Re: Triple parity and beyond

2013-11-23 Thread Russell Coker
On Sun, 24 Nov 2013, Stan Hoeppner  wrote:
> I have always surmised that the culprit is rotational latency, because
> we're not able to get a real sector-by-sector streaming read from each
> drive.  If even only one disk in the array has to wait for the platter
> to come round again, the entire stripe read is slowed down by an
> additional few milliseconds.  For example, in an 8 drive array let's say
> each stripe read is slowed 5ms by only one of the 7 drives due to
> rotational latency, maybe acoustical management, or some other firmware
> hiccup in the drive.  This slows down the entire stripe read because we
> can't do parity reconstruction until all chunks are in.  An 8x 2TB array
> with 512KB chunk has 4 million stripes of 4MB each.  Reading 4M stripes,
> that extra 5ms per stripe read costs us
> 
> (4,000,000 * 0.005)/3600 = 5.56 hours

If that is the problem then the solution would be to just enable read-ahead.  
Don't we already have that in both the OS and the disk hardware?  The hard-
drive read-ahead buffer should at least cover the case where a seek completes 
but the desired sector isn't under the heads.

RAM size is steadily increasing, it seems that the smallest that you can get 
nowadays is 1G in a phone and for a server the smallest is probably 4G.

On the smallest system that might have an 8 disk array you should be able to 
use 512M for buffers which allows a read-ahead of 128 chunks.

> Now consider that arrays typically have a few years on them before the
> first drive failure.  During our rebuild it's likely that some drives
> will take a few rotations to return a sector that's marginal.

Are you suggesting that it would be a common case that people just write data 
to an array and never read it or do an array scrub?  I hope that it will 
become standard practice to have a cron job scrubbing all filesystems.

> So  this
> might slow down a stripe read by dozens of milliseconds, maybe a full
> second.  If this happens to multiple drives many times throughout the
> rebuild it will add even more elapsed time, possibly additional hours.

Have you observed such 1 second reads in practice?

One thing I've considered doing is placing a cheap disk on a speaker cone to 
test vibration induced performance problems.  Then I can use a PC to control 
the level of vibration in a reasonably repeatable manner.  I'd like to see 
what the limits are for retries.

Some years ago a company I worked for had some vibration problems which 
dropped the contiguous read speed from about 100MB/s to about 40MB/s on some 
parts of the disk (other parts gave full performance).  That was a serious and 
unusual problem and it only abouty halved the overall speed.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-23 Thread John Williams
On Sat, Nov 23, 2013 at 8:03 PM, Stan Hoeppner  wrote:

> Parity array rebuilds are read-modify-write operations.  The main
> difference from normal operation RMWs is that the write is always to the
> same disk.  As long as the stripe reads and chunk reconstruction outrun
> the write throughput then the rebuild speed should be as fast as a
> mirror rebuild.  But this doesn't appear to be what people are
> experiencing.  Parity rebuilds would seem to take much longer.

"This" doesn't appear to be what SOME people, who have reported
issues, are experiencing. Their issues must be examined on a case by
case basis.

But I, and a number of other people I have talked to or corresponded
with, have had mdadm RAID 5 or RAID 6 rebuilds of one drive run at
approximately the optimal sequential write speed of the replacement
drive. It is not unusual on a reasonably configured system.

I don't know how fast the rebuilds go on the experimental RAID 5 or
RAID 6 for btrfs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


WARNING: CPU: 7 PID: 1239 at fs/btrfs/inode.c:4721 inode_tree_add+0xc2/0x13f [btrfs]()

2013-11-23 Thread Tomasz Chmielewski
I'm getting these with 3.13-rc1:

[53358.655620] [ cut here ]
[53358.655686] WARNING: CPU: 7 PID: 1239 at fs/btrfs/inode.c:4721 
inode_tree_add+0xc2/0x13f [btrfs]()
[53358.655779] Modules linked in: veth ipt_MASQUERADE iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables 
x_tables cpufreq_ondemand cpufreq_conservative cpufreq_powersave cpufreq_stats 
bridge stp llc ipv6 btrfs xor raid6_pq zlib_deflate loop i2c_i801 i2c_core 
lpc_ich ehci_pci ehci_hcd mfd_core button video acpi_cpufreq pcspkr ext4 crc16 
jbd2 mbcache raid1 sg sd_mod ahci libahci libata scsi_mod r8169 mii
[53358.656094] CPU: 7 PID: 1239 Comm: btrfs-endio-wri Not tainted 3.13.0-rc1 #1
[53358.656142] Hardware name: System manufacturer System Product Name/P8H77-M 
PRO, BIOS 1101 02/04/2013
[53358.656232]  0009 8803f1fedb18 81389e7d 
0006
[53358.656322]   8803f1fedb58 810370a9 
88053edb7e00
[53358.656411]  a02715f7 8804434aa580 8804434a3230 
8807edb37800
[53358.656500] Call Trace:
[53358.656546]  [] dump_stack+0x46/0x58
[53358.656595]  [] warn_slowpath_common+0x77/0x91
[53358.656648]  [] ? inode_tree_add+0xc2/0x13f [btrfs]
[53358.656697]  [] warn_slowpath_null+0x15/0x17
[53358.656749]  [] inode_tree_add+0xc2/0x13f [btrfs]
[53358.656802]  [] btrfs_iget+0x46c/0x4b6 [btrfs]
[53358.656851]  [] ? igrab+0x40/0x41
[53358.656902]  [] relink_extent_backref+0x105/0x6cf [btrfs]
[53358.656955]  [] btrfs_finish_ordered_io+0x7bd/0x877 [btrfs]
[53358.657008]  [] finish_ordered_fn+0x10/0x12 [btrfs]
[53358.657063]  [] worker_loop+0x15e/0x495 [btrfs]
[53358.657115]  [] ? btrfs_queue_worker+0x269/0x269 [btrfs]
[53358.657165]  [] kthread+0xcd/0xd5
[53358.657211]  [] ? kthread_freezable_should_stop+0x43/0x43
[53358.657260]  [] ret_from_fork+0x7c/0xb0
[53358.657307]  [] ? kthread_freezable_should_stop+0x43/0x43
[53358.657355] ---[ end trace bf9f7dd59e43977f ]---


-- 
Tomasz Chmielewski
http://wpkg.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: submit_bio_wait() conversions

2013-11-23 Thread NeilBrown
On Sat, 23 Nov 2013 20:03:30 -0800 Kent Overstreet  wrote:

> It was being open coded in a few places.
> 
> Signed-off-by: Kent Overstreet 
> Cc: Jens Axboe 
> Cc: Joern Engel 
> Cc: Prasad Joshi 
> Cc: Neil Brown 
> Cc: Chris Mason 

Acked-by: NeilBrown 

for the drivers/md/md.c bits, however...

> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index b6b7a28..8700de3 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -776,16 +776,10 @@ void md_super_wait(struct mddev *mddev)
>   finish_wait(&mddev->sb_wait, &wq);
>  }
>  
> -static void bi_complete(struct bio *bio, int error)
> -{
> - complete((struct completion*)bio->bi_private);
> -}
> -
>  int sync_page_io(struct md_rdev *rdev, sector_t sector, int size,
>struct page *page, int rw, bool metadata_op)
>  {
>   struct bio *bio = bio_alloc_mddev(GFP_NOIO, 1, rdev->mddev);
> - struct completion event;
>   int ret;
>  
>   rw |= REQ_SYNC;
^^^

you could remove this line as well, as submit_bio_wait sets this flag for us.


> @@ -801,11 +795,7 @@ int sync_page_io(struct md_rdev *rdev, sector_t sector, 
> int size,
>   else
>   bio->bi_sector = sector + rdev->data_offset;
>   bio_add_page(bio, page, size, 0);
> - init_completion(&event);
> - bio->bi_private = &event;
> - bio->bi_end_io = bi_complete;
> - submit_bio(rw, bio);
> - wait_for_completion(&event);
> + submit_bio_wait(rw, bio);
>  
>   ret = test_bit(BIO_UPTODATE, &bio->bi_flags);
>   bio_put(bio);

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: Triple parity and beyond

2013-11-23 Thread Stan Hoeppner
On 11/23/2013 1:12 AM, NeilBrown wrote:
> On Fri, 22 Nov 2013 21:34:41 -0800 John Williams 

>> Even a single 8x PCIe 3.0 card has potentially over 7GB/s of bandwidth.
>>
>> Bottom line is that IO bandwidth is not a problem for a system with
>> prudently chosen hardware.

Quite right.

>> More likely is that you would be CPU limited (rather than bus limited)
>> in a high-parity rebuild where more than one drive failed. But even
>> that is not likely to be too bad, since Andrea's single-threaded
>> recovery code can recover two drives at nearly 1GB/s on one of my
>> machines. I think the code could probably be threaded to achieve a
>> multiple of that running on multiple cores.
> 
> Indeed.  It seems likely that with modern hardware, the  linear write speed
> would be the limiting factor for spinning-rust drives.

Parity array rebuilds are read-modify-write operations.  The main
difference from normal operation RMWs is that the write is always to the
same disk.  As long as the stripe reads and chunk reconstruction outrun
the write throughput then the rebuild speed should be as fast as a
mirror rebuild.  But this doesn't appear to be what people are
experiencing.  Parity rebuilds would seem to take much longer.

I have always surmised that the culprit is rotational latency, because
we're not able to get a real sector-by-sector streaming read from each
drive.  If even only one disk in the array has to wait for the platter
to come round again, the entire stripe read is slowed down by an
additional few milliseconds.  For example, in an 8 drive array let's say
each stripe read is slowed 5ms by only one of the 7 drives due to
rotational latency, maybe acoustical management, or some other firmware
hiccup in the drive.  This slows down the entire stripe read because we
can't do parity reconstruction until all chunks are in.  An 8x 2TB array
with 512KB chunk has 4 million stripes of 4MB each.  Reading 4M stripes,
that extra 5ms per stripe read costs us

(4,000,000 * 0.005)/3600 = 5.56 hours

Now consider that arrays typically have a few years on them before the
first drive failure.  During our rebuild it's likely that some drives
will take a few rotations to return a sector that's marginal.  So  this
might slow down a stripe read by dozens of milliseconds, maybe a full
second.  If this happens to multiple drives many times throughout the
rebuild it will add even more elapsed time, possibly additional hours.

Reading stripes asynchronously or in parallel, which I assume we already
do to some degree, can mitigate these latencies to some extent.  But I
think in the overall picture, things of this nature are what is driving
parity rebuilds to dozens of hours for many people.  And as I stated
previously, when drives reach 10-20TB, this becomes far worse because
we're reading 2-10x as many stripes.  And the more drives per array the
greater the odds of incurring latency during a stripe read.

With a mirror reconstruction we can stream the reads.  Though we can't
avoid all of the drive issues above, the total number of hiccups causing
latency will be at most 1/7th those of the parity 8 drive array case.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] block: submit_bio_wait() conversions

2013-11-23 Thread Kent Overstreet
It was being open coded in a few places.

Signed-off-by: Kent Overstreet 
Cc: Jens Axboe 
Cc: Joern Engel 
Cc: Prasad Joshi 
Cc: Neil Brown 
Cc: Chris Mason 
---
 block/blk-flush.c  | 19 +--
 drivers/md/md.c| 12 +---
 fs/btrfs/check-integrity.c | 32 +---
 fs/btrfs/check-integrity.h |  2 ++
 fs/btrfs/extent_io.c   | 12 +---
 fs/btrfs/scrub.c   | 33 -
 fs/hfsplus/wrapper.c   | 17 +
 fs/logfs/dev_bdev.c| 13 +
 8 files changed, 24 insertions(+), 116 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 331e627..fb6f3c0 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -502,15 +502,6 @@ void blk_abort_flushes(struct request_queue *q)
}
 }
 
-static void bio_end_flush(struct bio *bio, int err)
-{
-   if (err)
-   clear_bit(BIO_UPTODATE, &bio->bi_flags);
-   if (bio->bi_private)
-   complete(bio->bi_private);
-   bio_put(bio);
-}
-
 /**
  * blkdev_issue_flush - queue a flush
  * @bdev:  blockdev to issue flush for
@@ -526,7 +517,6 @@ static void bio_end_flush(struct bio *bio, int err)
 int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
sector_t *error_sector)
 {
-   DECLARE_COMPLETION_ONSTACK(wait);
struct request_queue *q;
struct bio *bio;
int ret = 0;
@@ -548,13 +538,9 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t 
gfp_mask,
return -ENXIO;
 
bio = bio_alloc(gfp_mask, 0);
-   bio->bi_end_io = bio_end_flush;
bio->bi_bdev = bdev;
-   bio->bi_private = &wait;
 
-   bio_get(bio);
-   submit_bio(WRITE_FLUSH, bio);
-   wait_for_completion_io(&wait);
+   ret = submit_bio_wait(WRITE_FLUSH, bio);
 
/*
 * The driver must store the error location in ->bi_sector, if
@@ -564,9 +550,6 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t 
gfp_mask,
if (error_sector)
*error_sector = bio->bi_sector;
 
-   if (!bio_flagged(bio, BIO_UPTODATE))
-   ret = -EIO;
-
bio_put(bio);
return ret;
 }
diff --git a/drivers/md/md.c b/drivers/md/md.c
index b6b7a28..8700de3 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -776,16 +776,10 @@ void md_super_wait(struct mddev *mddev)
finish_wait(&mddev->sb_wait, &wq);
 }
 
-static void bi_complete(struct bio *bio, int error)
-{
-   complete((struct completion*)bio->bi_private);
-}
-
 int sync_page_io(struct md_rdev *rdev, sector_t sector, int size,
 struct page *page, int rw, bool metadata_op)
 {
struct bio *bio = bio_alloc_mddev(GFP_NOIO, 1, rdev->mddev);
-   struct completion event;
int ret;
 
rw |= REQ_SYNC;
@@ -801,11 +795,7 @@ int sync_page_io(struct md_rdev *rdev, sector_t sector, 
int size,
else
bio->bi_sector = sector + rdev->data_offset;
bio_add_page(bio, page, size, 0);
-   init_completion(&event);
-   bio->bi_private = &event;
-   bio->bi_end_io = bi_complete;
-   submit_bio(rw, bio);
-   wait_for_completion(&event);
+   submit_bio_wait(rw, bio);
 
ret = test_bit(BIO_UPTODATE, &bio->bi_flags);
bio_put(bio);
diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index b50764b..131d828 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -333,7 +333,6 @@ static void btrfsic_release_block_ctx(struct 
btrfsic_block_data_ctx *block_ctx);
 static int btrfsic_read_block(struct btrfsic_state *state,
  struct btrfsic_block_data_ctx *block_ctx);
 static void btrfsic_dump_database(struct btrfsic_state *state);
-static void btrfsic_complete_bio_end_io(struct bio *bio, int err);
 static int btrfsic_test_for_metadata(struct btrfsic_state *state,
 char **datav, unsigned int num_pages);
 static void btrfsic_process_written_block(struct btrfsic_dev_state *dev_state,
@@ -1687,7 +1686,6 @@ static int btrfsic_read_block(struct btrfsic_state *state,
for (i = 0; i < num_pages;) {
struct bio *bio;
unsigned int j;
-   DECLARE_COMPLETION_ONSTACK(complete);
 
bio = btrfs_io_bio_alloc(GFP_NOFS, num_pages - i);
if (!bio) {
@@ -1698,8 +1696,6 @@ static int btrfsic_read_block(struct btrfsic_state *state,
}
bio->bi_bdev = block_ctx->dev->bdev;
bio->bi_sector = dev_bytenr >> 9;
-   bio->bi_end_io = btrfsic_complete_bio_end_io;
-   bio->bi_private = &complete;
 
for (j = i; j < num_pages; j++) {
ret = bio_add_page(bio, block_ctx->pagev[j],
@@ -1712,12 +1708,7 @@ static int btrfsic_read_block(struct btrfsic_state 
*state,
   "btrfsic: erro

Re: Triple parity and beyond

2013-11-23 Thread Piergiorgio Sartor
Hi Andrea,

On Sat, Nov 23, 2013 at 08:55:08AM +0100, Andrea Mazzoleni wrote:
> Hi Piergiorgio,
> 
> > How about par2? How does this work?
> I checked the matrix they use, and sometimes it contains some singular
> square submatrix.
> It seems that in GF(2^16) these cases are just less common. Maybe they
> were just unnoticed.
> 
> Anyway, this seems to be an already known problem for PAR2, with an
> hypothetical PAR3 fixing it:
> 
> http://sourceforge.net/p/parchive/discussion/96282/thread/d3c6597b/

you did a pretty damn good research work!

Maybe you should consider to contact them too.
I'm not sure if your approach can be extended
to GF(2^16), I guess yes, in that case they
might be interested too.

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-23 Thread David Brown

On 22/11/13 23:59, NeilBrown wrote:

On Fri, 22 Nov 2013 10:07:09 -0600 Stan Hoeppner 
wrote:




In the event of a double drive failure in one mirror, the RAID 1 code
will need to be modified in such a way as to allow the RAID 5 code to
rebuild the first replacement disk, because the RAID 1 device is still
in a failed state.  Once this rebuild is complete, the RAID 1 code will
need to switch the state to degraded, and then do its standard rebuild
routine for the 2nd replacement drive.

Or, with some (likely major) hacking it should be possible to rebuild
both drives simultaneously for no loss of throughput or additional
elapsed time on the RAID 5 rebuild.


Nah, that would be minor hacking.  Just recreate the RAID1 in a state that is
not-insync, but with automatic-resync disabled.
Then as continuous writes arrive, move the "recovery_cp" variable forward
towards the end of the array.  When it reaches the end we can safely mark the
whole array as 'in-sync' and forget about diabling auto-resync.

NeilBrown



That was my thoughts here.  I don't know what state the planned "bitmap 
of non-sync regions" feature is in, but if and when it is implemented, 
you would just create the replacement raid1 pair without any 
synchronisation.  Any writes to the pair (such as during a raid5 
rebuild) would get written to both disks at the same time.


David


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Nagios probe for btrfs RAID status?

2013-11-23 Thread Duncan
Daniel Pocock posted on Sat, 23 Nov 2013 12:44:25 +0100 as excerpted:

>> [btrfs manpage quote]
>> btrfs device stats [-z] {|}
>> 
>> Read and print the device IO stats for all devices of the filesystem
>> identified by  or for a single .

>> -z   Reset stats to zero after reading them.

>> Here's the output for my (dual device btrfs raid1) rootfs, here:
>> 
>> btrfs dev stat /
>> [/dev/sdc5].write_io_errs   0
>> [/dev/sdc5].read_io_errs0
>> [/dev/sdc5].flush_io_errs   0
>> [/dev/sdc5].corruption_errs 0
>> [/dev/sdc5].generation_errs 0
>> [/dev/sda5].write_io_errs   0
>> [/dev/sda5].read_io_errs0
>> [/dev/sda5].flush_io_errs   0
>> [/dev/sda5].corruption_errs 0
>> [/dev/sda5].generation_errs 0
>> 
>> As you can see, for multi-device filesystems it gives the stats per
>> component device.  Any errors accumulate until a reset using -z, so you
>> can easily see if the numbers are increasing over time and by how much.

> That looks interesting - are these explained anywhere?

I'd guess in the sources...  There's nothing more in the manpage about 
them, and nothing on the wiki.  Some weeks ago I scanned some of the 
whitepapers listed on the wiki, and found most of them frustratingly "big 
picture" vague on such details as well. =:^(  There was one that had a 
bit of detail, but only about half of what I was looking for at the time 
(the difference between leafsize, sectorsize and nodesize, three option 
knobs available on the mkfs.btrfs commandline, and what they actually 
tuned, and while I was at it, how they related to btrfs chunks) was there 
either, and even then not really explained very clearly).  So it seems a 
lot of the documentation is sources-only at this point. =:^(

> Should a Nagios plugin just look for any non-zero value or just focus on
> some of those?

I could guess at what some of them are and their significance based on 
what I've seen here, but I'm afraid my guesses wouldn't rate well in SNR 
terms, so I'll abstain...

> Are they runtime stats (since system boot) or are they maintained in the
> filesystem on disk?

The records are maintained across mounts/boots so must be stored on-
disk.  Only the -z switch zeroes.

> My own version of the btrfs utility doesn't have that command though, I
> am using a Debian stable system.  I tried a newer version and it gives
> 
> ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS)
> 
> so I probably need to update my kernel too.

You've likely read it before, but btrfs remains a filesystem under heavy 
development, with every kernel bringing fixes for known bugs and 
userspace tools developed in tandem, and every btrfs user at this point 
is by definition a development filesystem tester.  While there are 
reasons one may wish to be conservative and stick with a known stable 
system, they really tend to be antithetical with the reasons one would 
have for testing something as development edge as btrfs at this point.  
Thus, upgrading to a current kernel (3.12.x at this point, if not 3.13 
development kernel as rc1 just came out) and btrfs-progs (at least, you 
can keep the rest of the system stable Debian if you like) is very 
strongly recommended if you're testing btrfs, in any case.

(For btrfs-progs, development happens in git branches, with merges to the 
master branch only when changes are considered release-ready.  So current 
git-master btrfs-progs is always the reference.  FWIW, here's what btrfs 
--version outputs here, btrfs-progs from git updated as of yesterday as 
it happens, tho I usually keep within a week or two: Btrfs v0.20-rc1-598-
g8116550.)

See the btrfs wiki for more:  https://btrfs.wiki.kernel.org.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Nagios probe for btrfs RAID status?

2013-11-23 Thread Daniel Pocock


On 23/11/13 11:35, Duncan wrote:
> Daniel Pocock posted on Sat, 23 Nov 2013 09:37:50 +0100 as excerpted:
> 
>> What about when btrfs detects a bad block checksum and recovers data
>> from the equivalent block on another disk?  The wiki says there will be
>> a syslog event.  Does btrfs keep any stats on the number of blocks that
>> it considers unreliable and can this be queried from user space?
> 
> The way you phrased that question is strange to me (considers unreliable?
> does that mean ones that it had to fix, or ones that it had to fix more 
> than once, or...), so I'm not sure this answers it, but from the btrfs 
> manpage...


Let me clarify: when I said unreliable, I was referring to those blocks
where the block device driver reads the block without reporting any
error but where btrfs has decided the checksum is bad and not used the
data from the block.

Such blocks definitely exist. Sometimes the data was corrupted at the
moment of writing and no matter how many times you read the block, you
always get a bad checksum.


>
> 
> btrfs device stats [-z] {|}
> 
> Read and print the device IO stats for all devices of the filesystem 
> identified by  or for a single .
> 
> Options
> 
> -z   Reset stats to zero after reading them.
> 
> 
> 
> Here's the output for my (dual device btrfs raid1) rootfs, here:
> 
> btrfs dev stat /
> [/dev/sdc5].write_io_errs   0
> [/dev/sdc5].read_io_errs0
> [/dev/sdc5].flush_io_errs   0
> [/dev/sdc5].corruption_errs 0
> [/dev/sdc5].generation_errs 0
> [/dev/sda5].write_io_errs   0
> [/dev/sda5].read_io_errs0
> [/dev/sda5].flush_io_errs   0
> [/dev/sda5].corruption_errs 0
> [/dev/sda5].generation_errs 0
> 
> As you can see, for multi-device filesystems it gives the stats per 
> component device.  Any errors accumulate until a reset using -z, so you 
> can easily see if the numbers are increasing over time and by how much.
> 


That looks interesting - are these explained anywhere?

Should a Nagios plugin just look for any non-zero value or just focus on
some of those?

Are they runtime stats (since system boot) or are they maintained in the
filesystem on disk?

My own version of the btrfs utility doesn't have that command though, I
am using a Debian stable system.  I tried a newer version and it gives

ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS)

so I probably need to update my kernel too.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Add support for object properties

2013-11-23 Thread Goffredo Baroncelli
Hi David,

On 2013-11-23 01:52, David Sterba wrote:
> On Tue, Nov 12, 2013 at 01:41:41PM +, Filipe David Borba Manana wrote:
>> This is a revised version of the original proposal/work from Alexander Block
>> to introduce a generic framework to set properties on btrfs filesystem 
>> objects
>> (inodes, subvolumes, filesystems, devices).
> 
>> Currently the command group looks like this:
>> btrfs prop set [-t ] /path/to/object  
>> btrfs prop get [-t ] /path/to/object [] (omitting name 
>> dumps all)
>> btrfs prop list [-t ] /path/to/object (lists properties with 
>> description)
>>
>> The type is used to explicitly specify what type of object you mean. 
>> This is 
>> necessary in case the object+property combination is ambiguous.  For 
>> example
>> '/path/to/fs/root' could mean the root subvolume, the directory 
>> inode or the 
>> filesystem itself. Normally, btrfs-progs will try to detect the type 
>> automatically.
> 
> The generic commandline UI still looks ok to me, storing properties as xattr’s
> is also ok (provided that we can capture the “btrfs.” xattr namespace).
> 
> I’ll dump my thoughts and questions about the rest.
> 
> 1) Where are stored properties that are not directly attached to an inode? Ie.
>whole-filesystem and device. How can I access props for a subvolume that is
>not currently reachable in the directory tree?
> 
> A fs or device props must be accessible any time, eg. no matter which 
> subvolume
> is currently mounted. This should be probably a special case where the 
> property
> can be queried from any inode but will be internally routed to eg. toplevel
> subvolume that will store the respective xattrs.


I think that we should divided "how access the properties" and "where
the properties are stored".

- Storing the *inode* properties in xattrs makes sense: it is a clean
interface, and the infrastructure is capable to hold all the informations.
It could be discussed also if we need to use 1 property <-> 1 xattr or
use one xattr to hold more properties (this could alleviate some
performance problem)

- Storing the *subvolume* and the *filesystem* properties in an xattr
associated to subvolue or '/' inode could be done. When it is accessed
this subvolume or '/' inode (during the mount and/or inode traversing)
all the information could be read from the xattr and stored in the
btrfs_fs_info struct and  btrfs_root struct (which are easily accessible
from the inode, avoiding performance issues)

- For the *device* properties, we could think to use other trees like
the device tree to store the information but still using the *xattr
interfaces to access them.

> 
> 
> 2) if a property’s object can be ambiguous, how is that distinguished in the
>xattrs?
> 
> We don’t have a list of props yet, so I’m trying to use one that hopefully
> makes some sense. The example here can be persistent-mount-options that are
> attached to fs and a subvolume. The fs-wide props will apply when a different
> subvolume is explicitly mounted.
> 
> Assuming that the xattrs are stored with the toplevel subvolume, the fs-wide
> and per-subvolume property must have a differnt xattr name (another option is
> to store fs-wide elsewhere). So the question is, if we should encode the
> property object into the xattr name directly. Eg.:
> 
>   btrfs.fs.persistent_mount
>   btrfs.subvol.persistent_mount
> 
> or if the fs-wide uses a reserved naming scheme that would appear as xattr
> named
> 
>   btrfs.persistent_mount
> 
> but the value would differ if queried with ‘-t fs or ‘-t subvolume’.
> 
> 
> 3) property precedence, interaction with mount options
> 
> The precedence should follow the rule of the least surprise, ie. if I set eg. 
> a
> compression of a file to zlib, set the subvolume compression type to ‘none’ 
> and
> have fs-wide mount the filesystem with compress=lzo, I’m expecting that the
> file will use ‘zlib’.

> 
> The generic rule says that mount options (that have a corresponding property)
> take precedence. There may be exceptions.


As general rule I suggest the following priority list:

fs props, subvol props, dir props, inode props, mount opts

It should be possible to "override" via mount option all options (to
force some behaviour).

However I have some doubts about:

1) in case of nested subvolumes, should the inner subvolumes inherit the
properties of the outer subvolumes? I think no, because it is not so
obvious the hierarchy when a subvolume is mounted (or moved).

2) there are properties that are "inode" related. Other no. For example
does make sense to have inode-setting about compression, raid profile,
datasum/cow ... when the data is shared between different
inode/subvolume which could have different setup? Which should be the
"least surprise" behaviour ? Or we should intend these properties only
as hints for the new file/data/chunk ? Anyway what it would do a balance
in these case ?
I am inclined to thi

Re: Nagios probe for btrfs RAID status?

2013-11-23 Thread Duncan
Daniel Pocock posted on Sat, 23 Nov 2013 09:37:50 +0100 as excerpted:

> What about when btrfs detects a bad block checksum and recovers data
> from the equivalent block on another disk?  The wiki says there will be
> a syslog event.  Does btrfs keep any stats on the number of blocks that
> it considers unreliable and can this be queried from user space?

The way you phrased that question is strange to me (considers unreliable?
does that mean ones that it had to fix, or ones that it had to fix more 
than once, or...), so I'm not sure this answers it, but from the btrfs 
manpage...



btrfs device stats [-z] {|}

Read and print the device IO stats for all devices of the filesystem 
identified by  or for a single .

Options

-z   Reset stats to zero after reading them.



Here's the output for my (dual device btrfs raid1) rootfs, here:

btrfs dev stat /
[/dev/sdc5].write_io_errs   0
[/dev/sdc5].read_io_errs0
[/dev/sdc5].flush_io_errs   0
[/dev/sdc5].corruption_errs 0
[/dev/sdc5].generation_errs 0
[/dev/sda5].write_io_errs   0
[/dev/sda5].read_io_errs0
[/dev/sda5].flush_io_errs   0
[/dev/sda5].corruption_errs 0
[/dev/sda5].generation_errs 0

As you can see, for multi-device filesystems it gives the stats per 
component device.  Any errors accumulate until a reset using -z, so you 
can easily see if the numbers are increasing over time and by how much.



-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Nagios probe for btrfs RAID status?

2013-11-23 Thread Daniel Pocock


On 23/11/13 09:37, Daniel Pocock wrote:
> 
> 
> On 23/11/13 04:59, Anand Jain wrote:
>>
>>
>>> For example, would the command
>>>
>>>  btrfs filesystem show --all-devices
>>>
>>> give a non-zero error status or some other clue if any of the devices
>>> are at risk?
>>
>>  No there isn't any good way as of now. that's something to fix.
> 
> Does it require kernel/driver code changes or it should be possible to
> implement in the user space utility?
> 
> It would be useful for people testing the filesystem to know when they
> get into trouble so they can investigate more quickly (and before the
> point of no return)
> 
>> [btrfs personal user/sysadmin, not a dev, not anything large enough to
>> have personal nagios experience...]
>>
>> AFAIK, btrfs raid modes currently switch the filesystem to read-only on
>> any device-drop error. That has been deemed the simplest/safest policy
>> during development, tho at some point as stable approaches the behavior
>> could theoretically be made optional.
> 
> None of the warnings about btrfs's experimental status hint at that,
> some people may be surprised by it.
> 
>> So detection could watch for read-only and act accordingly, either
>> switching back to read-write or rebooting or simply logging the event,
>> as deemed appropriate.
> 
> It would be relatively trivial to implement a Nagios check for
> read-only, Nagios probes are just shell scripts

Just checked, it already exists, so we are half way there:

http://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_ro_mounts/details


> 
> What about when btrfs detects a bad block checksum and recovers data
> from the equivalent block on another disk?  The wiki says there will be
> a syslog event.  Does btrfs keep any stats on the number of blocks that
> it considers unreliable and can this be queried from user space?
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Nagios probe for btrfs RAID status?

2013-11-23 Thread Daniel Pocock


On 23/11/13 04:59, Anand Jain wrote:
> 
> 
>> For example, would the command
>>
>>  btrfs filesystem show --all-devices
>>
>> give a non-zero error status or some other clue if any of the devices
>> are at risk?
> 
>  No there isn't any good way as of now. that's something to fix.

Does it require kernel/driver code changes or it should be possible to
implement in the user space utility?

It would be useful for people testing the filesystem to know when they
get into trouble so they can investigate more quickly (and before the
point of no return)

> [btrfs personal user/sysadmin, not a dev, not anything large enough to
> have personal nagios experience...]
> 
> AFAIK, btrfs raid modes currently switch the filesystem to read-only on
> any device-drop error. That has been deemed the simplest/safest policy
> during development, tho at some point as stable approaches the behavior
> could theoretically be made optional.

None of the warnings about btrfs's experimental status hint at that,
some people may be surprised by it.

> So detection could watch for read-only and act accordingly, either
> switching back to read-write or rebooting or simply logging the event,
> as deemed appropriate.

It would be relatively trivial to implement a Nagios check for
read-only, Nagios probes are just shell scripts

What about when btrfs detects a bad block checksum and recovers data
from the equivalent block on another disk?  The wiki says there will be
a syslog event.  Does btrfs keep any stats on the number of blocks that
it considers unreliable and can this be queried from user space?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html