Re: Triple parity and beyond
On Sun, 24 Nov 2013, Stan Hoeppner wrote: > I have always surmised that the culprit is rotational latency, because > we're not able to get a real sector-by-sector streaming read from each > drive. If even only one disk in the array has to wait for the platter > to come round again, the entire stripe read is slowed down by an > additional few milliseconds. For example, in an 8 drive array let's say > each stripe read is slowed 5ms by only one of the 7 drives due to > rotational latency, maybe acoustical management, or some other firmware > hiccup in the drive. This slows down the entire stripe read because we > can't do parity reconstruction until all chunks are in. An 8x 2TB array > with 512KB chunk has 4 million stripes of 4MB each. Reading 4M stripes, > that extra 5ms per stripe read costs us > > (4,000,000 * 0.005)/3600 = 5.56 hours If that is the problem then the solution would be to just enable read-ahead. Don't we already have that in both the OS and the disk hardware? The hard- drive read-ahead buffer should at least cover the case where a seek completes but the desired sector isn't under the heads. RAM size is steadily increasing, it seems that the smallest that you can get nowadays is 1G in a phone and for a server the smallest is probably 4G. On the smallest system that might have an 8 disk array you should be able to use 512M for buffers which allows a read-ahead of 128 chunks. > Now consider that arrays typically have a few years on them before the > first drive failure. During our rebuild it's likely that some drives > will take a few rotations to return a sector that's marginal. Are you suggesting that it would be a common case that people just write data to an array and never read it or do an array scrub? I hope that it will become standard practice to have a cron job scrubbing all filesystems. > So this > might slow down a stripe read by dozens of milliseconds, maybe a full > second. If this happens to multiple drives many times throughout the > rebuild it will add even more elapsed time, possibly additional hours. Have you observed such 1 second reads in practice? One thing I've considered doing is placing a cheap disk on a speaker cone to test vibration induced performance problems. Then I can use a PC to control the level of vibration in a reasonably repeatable manner. I'd like to see what the limits are for retries. Some years ago a company I worked for had some vibration problems which dropped the contiguous read speed from about 100MB/s to about 40MB/s on some parts of the disk (other parts gave full performance). That was a serious and unusual problem and it only abouty halved the overall speed. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On Sat, Nov 23, 2013 at 8:03 PM, Stan Hoeppner wrote: > Parity array rebuilds are read-modify-write operations. The main > difference from normal operation RMWs is that the write is always to the > same disk. As long as the stripe reads and chunk reconstruction outrun > the write throughput then the rebuild speed should be as fast as a > mirror rebuild. But this doesn't appear to be what people are > experiencing. Parity rebuilds would seem to take much longer. "This" doesn't appear to be what SOME people, who have reported issues, are experiencing. Their issues must be examined on a case by case basis. But I, and a number of other people I have talked to or corresponded with, have had mdadm RAID 5 or RAID 6 rebuilds of one drive run at approximately the optimal sequential write speed of the replacement drive. It is not unusual on a reasonably configured system. I don't know how fast the rebuilds go on the experimental RAID 5 or RAID 6 for btrfs. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
WARNING: CPU: 7 PID: 1239 at fs/btrfs/inode.c:4721 inode_tree_add+0xc2/0x13f [btrfs]()
I'm getting these with 3.13-rc1: [53358.655620] [ cut here ] [53358.655686] WARNING: CPU: 7 PID: 1239 at fs/btrfs/inode.c:4721 inode_tree_add+0xc2/0x13f [btrfs]() [53358.655779] Modules linked in: veth ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables cpufreq_ondemand cpufreq_conservative cpufreq_powersave cpufreq_stats bridge stp llc ipv6 btrfs xor raid6_pq zlib_deflate loop i2c_i801 i2c_core lpc_ich ehci_pci ehci_hcd mfd_core button video acpi_cpufreq pcspkr ext4 crc16 jbd2 mbcache raid1 sg sd_mod ahci libahci libata scsi_mod r8169 mii [53358.656094] CPU: 7 PID: 1239 Comm: btrfs-endio-wri Not tainted 3.13.0-rc1 #1 [53358.656142] Hardware name: System manufacturer System Product Name/P8H77-M PRO, BIOS 1101 02/04/2013 [53358.656232] 0009 8803f1fedb18 81389e7d 0006 [53358.656322] 8803f1fedb58 810370a9 88053edb7e00 [53358.656411] a02715f7 8804434aa580 8804434a3230 8807edb37800 [53358.656500] Call Trace: [53358.656546] [] dump_stack+0x46/0x58 [53358.656595] [] warn_slowpath_common+0x77/0x91 [53358.656648] [] ? inode_tree_add+0xc2/0x13f [btrfs] [53358.656697] [] warn_slowpath_null+0x15/0x17 [53358.656749] [] inode_tree_add+0xc2/0x13f [btrfs] [53358.656802] [] btrfs_iget+0x46c/0x4b6 [btrfs] [53358.656851] [] ? igrab+0x40/0x41 [53358.656902] [] relink_extent_backref+0x105/0x6cf [btrfs] [53358.656955] [] btrfs_finish_ordered_io+0x7bd/0x877 [btrfs] [53358.657008] [] finish_ordered_fn+0x10/0x12 [btrfs] [53358.657063] [] worker_loop+0x15e/0x495 [btrfs] [53358.657115] [] ? btrfs_queue_worker+0x269/0x269 [btrfs] [53358.657165] [] kthread+0xcd/0xd5 [53358.657211] [] ? kthread_freezable_should_stop+0x43/0x43 [53358.657260] [] ret_from_fork+0x7c/0xb0 [53358.657307] [] ? kthread_freezable_should_stop+0x43/0x43 [53358.657355] ---[ end trace bf9f7dd59e43977f ]--- -- Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block: submit_bio_wait() conversions
On Sat, 23 Nov 2013 20:03:30 -0800 Kent Overstreet wrote: > It was being open coded in a few places. > > Signed-off-by: Kent Overstreet > Cc: Jens Axboe > Cc: Joern Engel > Cc: Prasad Joshi > Cc: Neil Brown > Cc: Chris Mason Acked-by: NeilBrown for the drivers/md/md.c bits, however... > diff --git a/drivers/md/md.c b/drivers/md/md.c > index b6b7a28..8700de3 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -776,16 +776,10 @@ void md_super_wait(struct mddev *mddev) > finish_wait(&mddev->sb_wait, &wq); > } > > -static void bi_complete(struct bio *bio, int error) > -{ > - complete((struct completion*)bio->bi_private); > -} > - > int sync_page_io(struct md_rdev *rdev, sector_t sector, int size, >struct page *page, int rw, bool metadata_op) > { > struct bio *bio = bio_alloc_mddev(GFP_NOIO, 1, rdev->mddev); > - struct completion event; > int ret; > > rw |= REQ_SYNC; ^^^ you could remove this line as well, as submit_bio_wait sets this flag for us. > @@ -801,11 +795,7 @@ int sync_page_io(struct md_rdev *rdev, sector_t sector, > int size, > else > bio->bi_sector = sector + rdev->data_offset; > bio_add_page(bio, page, size, 0); > - init_completion(&event); > - bio->bi_private = &event; > - bio->bi_end_io = bi_complete; > - submit_bio(rw, bio); > - wait_for_completion(&event); > + submit_bio_wait(rw, bio); > > ret = test_bit(BIO_UPTODATE, &bio->bi_flags); > bio_put(bio); Thanks, NeilBrown signature.asc Description: PGP signature
Re: Triple parity and beyond
On 11/23/2013 1:12 AM, NeilBrown wrote: > On Fri, 22 Nov 2013 21:34:41 -0800 John Williams >> Even a single 8x PCIe 3.0 card has potentially over 7GB/s of bandwidth. >> >> Bottom line is that IO bandwidth is not a problem for a system with >> prudently chosen hardware. Quite right. >> More likely is that you would be CPU limited (rather than bus limited) >> in a high-parity rebuild where more than one drive failed. But even >> that is not likely to be too bad, since Andrea's single-threaded >> recovery code can recover two drives at nearly 1GB/s on one of my >> machines. I think the code could probably be threaded to achieve a >> multiple of that running on multiple cores. > > Indeed. It seems likely that with modern hardware, the linear write speed > would be the limiting factor for spinning-rust drives. Parity array rebuilds are read-modify-write operations. The main difference from normal operation RMWs is that the write is always to the same disk. As long as the stripe reads and chunk reconstruction outrun the write throughput then the rebuild speed should be as fast as a mirror rebuild. But this doesn't appear to be what people are experiencing. Parity rebuilds would seem to take much longer. I have always surmised that the culprit is rotational latency, because we're not able to get a real sector-by-sector streaming read from each drive. If even only one disk in the array has to wait for the platter to come round again, the entire stripe read is slowed down by an additional few milliseconds. For example, in an 8 drive array let's say each stripe read is slowed 5ms by only one of the 7 drives due to rotational latency, maybe acoustical management, or some other firmware hiccup in the drive. This slows down the entire stripe read because we can't do parity reconstruction until all chunks are in. An 8x 2TB array with 512KB chunk has 4 million stripes of 4MB each. Reading 4M stripes, that extra 5ms per stripe read costs us (4,000,000 * 0.005)/3600 = 5.56 hours Now consider that arrays typically have a few years on them before the first drive failure. During our rebuild it's likely that some drives will take a few rotations to return a sector that's marginal. So this might slow down a stripe read by dozens of milliseconds, maybe a full second. If this happens to multiple drives many times throughout the rebuild it will add even more elapsed time, possibly additional hours. Reading stripes asynchronously or in parallel, which I assume we already do to some degree, can mitigate these latencies to some extent. But I think in the overall picture, things of this nature are what is driving parity rebuilds to dozens of hours for many people. And as I stated previously, when drives reach 10-20TB, this becomes far worse because we're reading 2-10x as many stripes. And the more drives per array the greater the odds of incurring latency during a stripe read. With a mirror reconstruction we can stream the reads. Though we can't avoid all of the drive issues above, the total number of hiccups causing latency will be at most 1/7th those of the parity 8 drive array case. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] block: submit_bio_wait() conversions
It was being open coded in a few places. Signed-off-by: Kent Overstreet Cc: Jens Axboe Cc: Joern Engel Cc: Prasad Joshi Cc: Neil Brown Cc: Chris Mason --- block/blk-flush.c | 19 +-- drivers/md/md.c| 12 +--- fs/btrfs/check-integrity.c | 32 +--- fs/btrfs/check-integrity.h | 2 ++ fs/btrfs/extent_io.c | 12 +--- fs/btrfs/scrub.c | 33 - fs/hfsplus/wrapper.c | 17 + fs/logfs/dev_bdev.c| 13 + 8 files changed, 24 insertions(+), 116 deletions(-) diff --git a/block/blk-flush.c b/block/blk-flush.c index 331e627..fb6f3c0 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -502,15 +502,6 @@ void blk_abort_flushes(struct request_queue *q) } } -static void bio_end_flush(struct bio *bio, int err) -{ - if (err) - clear_bit(BIO_UPTODATE, &bio->bi_flags); - if (bio->bi_private) - complete(bio->bi_private); - bio_put(bio); -} - /** * blkdev_issue_flush - queue a flush * @bdev: blockdev to issue flush for @@ -526,7 +517,6 @@ static void bio_end_flush(struct bio *bio, int err) int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask, sector_t *error_sector) { - DECLARE_COMPLETION_ONSTACK(wait); struct request_queue *q; struct bio *bio; int ret = 0; @@ -548,13 +538,9 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask, return -ENXIO; bio = bio_alloc(gfp_mask, 0); - bio->bi_end_io = bio_end_flush; bio->bi_bdev = bdev; - bio->bi_private = &wait; - bio_get(bio); - submit_bio(WRITE_FLUSH, bio); - wait_for_completion_io(&wait); + ret = submit_bio_wait(WRITE_FLUSH, bio); /* * The driver must store the error location in ->bi_sector, if @@ -564,9 +550,6 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask, if (error_sector) *error_sector = bio->bi_sector; - if (!bio_flagged(bio, BIO_UPTODATE)) - ret = -EIO; - bio_put(bio); return ret; } diff --git a/drivers/md/md.c b/drivers/md/md.c index b6b7a28..8700de3 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -776,16 +776,10 @@ void md_super_wait(struct mddev *mddev) finish_wait(&mddev->sb_wait, &wq); } -static void bi_complete(struct bio *bio, int error) -{ - complete((struct completion*)bio->bi_private); -} - int sync_page_io(struct md_rdev *rdev, sector_t sector, int size, struct page *page, int rw, bool metadata_op) { struct bio *bio = bio_alloc_mddev(GFP_NOIO, 1, rdev->mddev); - struct completion event; int ret; rw |= REQ_SYNC; @@ -801,11 +795,7 @@ int sync_page_io(struct md_rdev *rdev, sector_t sector, int size, else bio->bi_sector = sector + rdev->data_offset; bio_add_page(bio, page, size, 0); - init_completion(&event); - bio->bi_private = &event; - bio->bi_end_io = bi_complete; - submit_bio(rw, bio); - wait_for_completion(&event); + submit_bio_wait(rw, bio); ret = test_bit(BIO_UPTODATE, &bio->bi_flags); bio_put(bio); diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index b50764b..131d828 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -333,7 +333,6 @@ static void btrfsic_release_block_ctx(struct btrfsic_block_data_ctx *block_ctx); static int btrfsic_read_block(struct btrfsic_state *state, struct btrfsic_block_data_ctx *block_ctx); static void btrfsic_dump_database(struct btrfsic_state *state); -static void btrfsic_complete_bio_end_io(struct bio *bio, int err); static int btrfsic_test_for_metadata(struct btrfsic_state *state, char **datav, unsigned int num_pages); static void btrfsic_process_written_block(struct btrfsic_dev_state *dev_state, @@ -1687,7 +1686,6 @@ static int btrfsic_read_block(struct btrfsic_state *state, for (i = 0; i < num_pages;) { struct bio *bio; unsigned int j; - DECLARE_COMPLETION_ONSTACK(complete); bio = btrfs_io_bio_alloc(GFP_NOFS, num_pages - i); if (!bio) { @@ -1698,8 +1696,6 @@ static int btrfsic_read_block(struct btrfsic_state *state, } bio->bi_bdev = block_ctx->dev->bdev; bio->bi_sector = dev_bytenr >> 9; - bio->bi_end_io = btrfsic_complete_bio_end_io; - bio->bi_private = &complete; for (j = i; j < num_pages; j++) { ret = bio_add_page(bio, block_ctx->pagev[j], @@ -1712,12 +1708,7 @@ static int btrfsic_read_block(struct btrfsic_state *state, "btrfsic: erro
Re: Triple parity and beyond
Hi Andrea, On Sat, Nov 23, 2013 at 08:55:08AM +0100, Andrea Mazzoleni wrote: > Hi Piergiorgio, > > > How about par2? How does this work? > I checked the matrix they use, and sometimes it contains some singular > square submatrix. > It seems that in GF(2^16) these cases are just less common. Maybe they > were just unnoticed. > > Anyway, this seems to be an already known problem for PAR2, with an > hypothetical PAR3 fixing it: > > http://sourceforge.net/p/parchive/discussion/96282/thread/d3c6597b/ you did a pretty damn good research work! Maybe you should consider to contact them too. I'm not sure if your approach can be extended to GF(2^16), I guess yes, in that case they might be interested too. bye, -- piergiorgio -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 22/11/13 23:59, NeilBrown wrote: On Fri, 22 Nov 2013 10:07:09 -0600 Stan Hoeppner wrote: In the event of a double drive failure in one mirror, the RAID 1 code will need to be modified in such a way as to allow the RAID 5 code to rebuild the first replacement disk, because the RAID 1 device is still in a failed state. Once this rebuild is complete, the RAID 1 code will need to switch the state to degraded, and then do its standard rebuild routine for the 2nd replacement drive. Or, with some (likely major) hacking it should be possible to rebuild both drives simultaneously for no loss of throughput or additional elapsed time on the RAID 5 rebuild. Nah, that would be minor hacking. Just recreate the RAID1 in a state that is not-insync, but with automatic-resync disabled. Then as continuous writes arrive, move the "recovery_cp" variable forward towards the end of the array. When it reaches the end we can safely mark the whole array as 'in-sync' and forget about diabling auto-resync. NeilBrown That was my thoughts here. I don't know what state the planned "bitmap of non-sync regions" feature is in, but if and when it is implemented, you would just create the replacement raid1 pair without any synchronisation. Any writes to the pair (such as during a raid5 rebuild) would get written to both disks at the same time. David -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Nagios probe for btrfs RAID status?
Daniel Pocock posted on Sat, 23 Nov 2013 12:44:25 +0100 as excerpted: >> [btrfs manpage quote] >> btrfs device stats [-z] {|} >> >> Read and print the device IO stats for all devices of the filesystem >> identified by or for a single . >> -z Reset stats to zero after reading them. >> Here's the output for my (dual device btrfs raid1) rootfs, here: >> >> btrfs dev stat / >> [/dev/sdc5].write_io_errs 0 >> [/dev/sdc5].read_io_errs0 >> [/dev/sdc5].flush_io_errs 0 >> [/dev/sdc5].corruption_errs 0 >> [/dev/sdc5].generation_errs 0 >> [/dev/sda5].write_io_errs 0 >> [/dev/sda5].read_io_errs0 >> [/dev/sda5].flush_io_errs 0 >> [/dev/sda5].corruption_errs 0 >> [/dev/sda5].generation_errs 0 >> >> As you can see, for multi-device filesystems it gives the stats per >> component device. Any errors accumulate until a reset using -z, so you >> can easily see if the numbers are increasing over time and by how much. > That looks interesting - are these explained anywhere? I'd guess in the sources... There's nothing more in the manpage about them, and nothing on the wiki. Some weeks ago I scanned some of the whitepapers listed on the wiki, and found most of them frustratingly "big picture" vague on such details as well. =:^( There was one that had a bit of detail, but only about half of what I was looking for at the time (the difference between leafsize, sectorsize and nodesize, three option knobs available on the mkfs.btrfs commandline, and what they actually tuned, and while I was at it, how they related to btrfs chunks) was there either, and even then not really explained very clearly). So it seems a lot of the documentation is sources-only at this point. =:^( > Should a Nagios plugin just look for any non-zero value or just focus on > some of those? I could guess at what some of them are and their significance based on what I've seen here, but I'm afraid my guesses wouldn't rate well in SNR terms, so I'll abstain... > Are they runtime stats (since system boot) or are they maintained in the > filesystem on disk? The records are maintained across mounts/boots so must be stored on- disk. Only the -z switch zeroes. > My own version of the btrfs utility doesn't have that command though, I > am using a Debian stable system. I tried a newer version and it gives > > ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS) > > so I probably need to update my kernel too. You've likely read it before, but btrfs remains a filesystem under heavy development, with every kernel bringing fixes for known bugs and userspace tools developed in tandem, and every btrfs user at this point is by definition a development filesystem tester. While there are reasons one may wish to be conservative and stick with a known stable system, they really tend to be antithetical with the reasons one would have for testing something as development edge as btrfs at this point. Thus, upgrading to a current kernel (3.12.x at this point, if not 3.13 development kernel as rc1 just came out) and btrfs-progs (at least, you can keep the rest of the system stable Debian if you like) is very strongly recommended if you're testing btrfs, in any case. (For btrfs-progs, development happens in git branches, with merges to the master branch only when changes are considered release-ready. So current git-master btrfs-progs is always the reference. FWIW, here's what btrfs --version outputs here, btrfs-progs from git updated as of yesterday as it happens, tho I usually keep within a week or two: Btrfs v0.20-rc1-598- g8116550.) See the btrfs wiki for more: https://btrfs.wiki.kernel.org. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Nagios probe for btrfs RAID status?
On 23/11/13 11:35, Duncan wrote: > Daniel Pocock posted on Sat, 23 Nov 2013 09:37:50 +0100 as excerpted: > >> What about when btrfs detects a bad block checksum and recovers data >> from the equivalent block on another disk? The wiki says there will be >> a syslog event. Does btrfs keep any stats on the number of blocks that >> it considers unreliable and can this be queried from user space? > > The way you phrased that question is strange to me (considers unreliable? > does that mean ones that it had to fix, or ones that it had to fix more > than once, or...), so I'm not sure this answers it, but from the btrfs > manpage... Let me clarify: when I said unreliable, I was referring to those blocks where the block device driver reads the block without reporting any error but where btrfs has decided the checksum is bad and not used the data from the block. Such blocks definitely exist. Sometimes the data was corrupted at the moment of writing and no matter how many times you read the block, you always get a bad checksum. > > > btrfs device stats [-z] {|} > > Read and print the device IO stats for all devices of the filesystem > identified by or for a single . > > Options > > -z Reset stats to zero after reading them. > > > > Here's the output for my (dual device btrfs raid1) rootfs, here: > > btrfs dev stat / > [/dev/sdc5].write_io_errs 0 > [/dev/sdc5].read_io_errs0 > [/dev/sdc5].flush_io_errs 0 > [/dev/sdc5].corruption_errs 0 > [/dev/sdc5].generation_errs 0 > [/dev/sda5].write_io_errs 0 > [/dev/sda5].read_io_errs0 > [/dev/sda5].flush_io_errs 0 > [/dev/sda5].corruption_errs 0 > [/dev/sda5].generation_errs 0 > > As you can see, for multi-device filesystems it gives the stats per > component device. Any errors accumulate until a reset using -z, so you > can easily see if the numbers are increasing over time and by how much. > That looks interesting - are these explained anywhere? Should a Nagios plugin just look for any non-zero value or just focus on some of those? Are they runtime stats (since system boot) or are they maintained in the filesystem on disk? My own version of the btrfs utility doesn't have that command though, I am using a Debian stable system. I tried a newer version and it gives ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS) so I probably need to update my kernel too. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Add support for object properties
Hi David, On 2013-11-23 01:52, David Sterba wrote: > On Tue, Nov 12, 2013 at 01:41:41PM +, Filipe David Borba Manana wrote: >> This is a revised version of the original proposal/work from Alexander Block >> to introduce a generic framework to set properties on btrfs filesystem >> objects >> (inodes, subvolumes, filesystems, devices). > >> Currently the command group looks like this: >> btrfs prop set [-t ] /path/to/object >> btrfs prop get [-t ] /path/to/object [] (omitting name >> dumps all) >> btrfs prop list [-t ] /path/to/object (lists properties with >> description) >> >> The type is used to explicitly specify what type of object you mean. >> This is >> necessary in case the object+property combination is ambiguous. For >> example >> '/path/to/fs/root' could mean the root subvolume, the directory >> inode or the >> filesystem itself. Normally, btrfs-progs will try to detect the type >> automatically. > > The generic commandline UI still looks ok to me, storing properties as xattr’s > is also ok (provided that we can capture the “btrfs.” xattr namespace). > > I’ll dump my thoughts and questions about the rest. > > 1) Where are stored properties that are not directly attached to an inode? Ie. >whole-filesystem and device. How can I access props for a subvolume that is >not currently reachable in the directory tree? > > A fs or device props must be accessible any time, eg. no matter which > subvolume > is currently mounted. This should be probably a special case where the > property > can be queried from any inode but will be internally routed to eg. toplevel > subvolume that will store the respective xattrs. I think that we should divided "how access the properties" and "where the properties are stored". - Storing the *inode* properties in xattrs makes sense: it is a clean interface, and the infrastructure is capable to hold all the informations. It could be discussed also if we need to use 1 property <-> 1 xattr or use one xattr to hold more properties (this could alleviate some performance problem) - Storing the *subvolume* and the *filesystem* properties in an xattr associated to subvolue or '/' inode could be done. When it is accessed this subvolume or '/' inode (during the mount and/or inode traversing) all the information could be read from the xattr and stored in the btrfs_fs_info struct and btrfs_root struct (which are easily accessible from the inode, avoiding performance issues) - For the *device* properties, we could think to use other trees like the device tree to store the information but still using the *xattr interfaces to access them. > > > 2) if a property’s object can be ambiguous, how is that distinguished in the >xattrs? > > We don’t have a list of props yet, so I’m trying to use one that hopefully > makes some sense. The example here can be persistent-mount-options that are > attached to fs and a subvolume. The fs-wide props will apply when a different > subvolume is explicitly mounted. > > Assuming that the xattrs are stored with the toplevel subvolume, the fs-wide > and per-subvolume property must have a differnt xattr name (another option is > to store fs-wide elsewhere). So the question is, if we should encode the > property object into the xattr name directly. Eg.: > > btrfs.fs.persistent_mount > btrfs.subvol.persistent_mount > > or if the fs-wide uses a reserved naming scheme that would appear as xattr > named > > btrfs.persistent_mount > > but the value would differ if queried with ‘-t fs or ‘-t subvolume’. > > > 3) property precedence, interaction with mount options > > The precedence should follow the rule of the least surprise, ie. if I set eg. > a > compression of a file to zlib, set the subvolume compression type to ‘none’ > and > have fs-wide mount the filesystem with compress=lzo, I’m expecting that the > file will use ‘zlib’. > > The generic rule says that mount options (that have a corresponding property) > take precedence. There may be exceptions. As general rule I suggest the following priority list: fs props, subvol props, dir props, inode props, mount opts It should be possible to "override" via mount option all options (to force some behaviour). However I have some doubts about: 1) in case of nested subvolumes, should the inner subvolumes inherit the properties of the outer subvolumes? I think no, because it is not so obvious the hierarchy when a subvolume is mounted (or moved). 2) there are properties that are "inode" related. Other no. For example does make sense to have inode-setting about compression, raid profile, datasum/cow ... when the data is shared between different inode/subvolume which could have different setup? Which should be the "least surprise" behaviour ? Or we should intend these properties only as hints for the new file/data/chunk ? Anyway what it would do a balance in these case ? I am inclined to thi
Re: Nagios probe for btrfs RAID status?
Daniel Pocock posted on Sat, 23 Nov 2013 09:37:50 +0100 as excerpted: > What about when btrfs detects a bad block checksum and recovers data > from the equivalent block on another disk? The wiki says there will be > a syslog event. Does btrfs keep any stats on the number of blocks that > it considers unreliable and can this be queried from user space? The way you phrased that question is strange to me (considers unreliable? does that mean ones that it had to fix, or ones that it had to fix more than once, or...), so I'm not sure this answers it, but from the btrfs manpage... btrfs device stats [-z] {|} Read and print the device IO stats for all devices of the filesystem identified by or for a single . Options -z Reset stats to zero after reading them. Here's the output for my (dual device btrfs raid1) rootfs, here: btrfs dev stat / [/dev/sdc5].write_io_errs 0 [/dev/sdc5].read_io_errs0 [/dev/sdc5].flush_io_errs 0 [/dev/sdc5].corruption_errs 0 [/dev/sdc5].generation_errs 0 [/dev/sda5].write_io_errs 0 [/dev/sda5].read_io_errs0 [/dev/sda5].flush_io_errs 0 [/dev/sda5].corruption_errs 0 [/dev/sda5].generation_errs 0 As you can see, for multi-device filesystems it gives the stats per component device. Any errors accumulate until a reset using -z, so you can easily see if the numbers are increasing over time and by how much. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Nagios probe for btrfs RAID status?
On 23/11/13 09:37, Daniel Pocock wrote: > > > On 23/11/13 04:59, Anand Jain wrote: >> >> >>> For example, would the command >>> >>> btrfs filesystem show --all-devices >>> >>> give a non-zero error status or some other clue if any of the devices >>> are at risk? >> >> No there isn't any good way as of now. that's something to fix. > > Does it require kernel/driver code changes or it should be possible to > implement in the user space utility? > > It would be useful for people testing the filesystem to know when they > get into trouble so they can investigate more quickly (and before the > point of no return) > >> [btrfs personal user/sysadmin, not a dev, not anything large enough to >> have personal nagios experience...] >> >> AFAIK, btrfs raid modes currently switch the filesystem to read-only on >> any device-drop error. That has been deemed the simplest/safest policy >> during development, tho at some point as stable approaches the behavior >> could theoretically be made optional. > > None of the warnings about btrfs's experimental status hint at that, > some people may be surprised by it. > >> So detection could watch for read-only and act accordingly, either >> switching back to read-write or rebooting or simply logging the event, >> as deemed appropriate. > > It would be relatively trivial to implement a Nagios check for > read-only, Nagios probes are just shell scripts Just checked, it already exists, so we are half way there: http://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_ro_mounts/details > > What about when btrfs detects a bad block checksum and recovers data > from the equivalent block on another disk? The wiki says there will be > a syslog event. Does btrfs keep any stats on the number of blocks that > it considers unreliable and can this be queried from user space? > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Nagios probe for btrfs RAID status?
On 23/11/13 04:59, Anand Jain wrote: > > >> For example, would the command >> >> btrfs filesystem show --all-devices >> >> give a non-zero error status or some other clue if any of the devices >> are at risk? > > No there isn't any good way as of now. that's something to fix. Does it require kernel/driver code changes or it should be possible to implement in the user space utility? It would be useful for people testing the filesystem to know when they get into trouble so they can investigate more quickly (and before the point of no return) > [btrfs personal user/sysadmin, not a dev, not anything large enough to > have personal nagios experience...] > > AFAIK, btrfs raid modes currently switch the filesystem to read-only on > any device-drop error. That has been deemed the simplest/safest policy > during development, tho at some point as stable approaches the behavior > could theoretically be made optional. None of the warnings about btrfs's experimental status hint at that, some people may be surprised by it. > So detection could watch for read-only and act accordingly, either > switching back to read-write or rebooting or simply logging the event, > as deemed appropriate. It would be relatively trivial to implement a Nagios check for read-only, Nagios probes are just shell scripts What about when btrfs detects a bad block checksum and recovers data from the equivalent block on another disk? The wiki says there will be a syslog event. Does btrfs keep any stats on the number of blocks that it considers unreliable and can this be queried from user space? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html