date:20140721

Re: ENOSPC errors during balance

2014-07-21 Thread Duncan

Marc Joliet posted on Tue, 22 Jul 2014 01:30:22 +0200 as excerpted:

> And now that the background deletion of the old snapshots is done, the file
> system ended up at:
> 
> # btrfs filesystem df /run/media/marcec/MARCEC_BACKUP
> Data, single: total=219.00GiB, used=140.13GiB
> System, DUP: total=32.00MiB, used=36.00KiB
> Metadata, DUP: total=4.50GiB, used=2.40GiB
> unknown, single: total=512.00MiB, used=0.00
> 
> I don't know how reliable du is for this, but I used it to estimate how much
> used data I should expect, and I get 138 GiB.  That means that the snapshots
> yield about 2 GiB "overhead", which is very reasonable, I think.  Obviously
> I'll be starting a full balance now.

FWIW, the balance should reduce the data total quite a bit, to 141-ish GiB
(might be 142 or 145, but it should definitely come down from 219 GiB),
because the spread between total and used is relatively high, now, and balance
is what's used to bring that back down.

Metadata total will probably come down a bit as well, to 3.00 GiB or so.

What's going on there is this:  Btrfs allocates and deallocates data and
metadata in two stages.  First it allocates chunks, 1 GiB in size for
data, 256 MiB in size for metadata, but because metadata is dup by default
it allocates two chunks so half a GiB at a time, there.  Then the actual
file data and metadata can be written into the pre-allocated chunks, filling
them up.  As they near full, more chunks will be allocated from the
unallocated pool as necessary.

But on file deletion, btrfs only automatically handles the file
data/metadata level; it doesn't (yet) automatically deallocate the chunks,
nor can it change the allocation from say a data chunk to a metadata chunk.
So when a chunk is allocated, it stays allocated.

That's the spread you see in btrfs filesystem df, between total and used,
for each chunk type.

The way to recover those allocated but unused chunks to the unallocated
pool, so they can be reallocated between data and metadata as necessary,
is with a balance.  That balance, therefore, should reduce the spread
seen in the above between total and used.

Meanwhile, btrfs filesystem df shows the spread between allocated and
used for each type, but what about unallocated?  Simple.  Btrfs
filesystem show lists total filesystem size as well as allocated
usage for each device.  (The total line is something else, I recommend
ignoring it as it's simply confusing.  Only pay attention to the
individual device lines.)

Thus, to get a proper picture of the space usage status on a btrfs
filesystem, you must have both the btrfs filesystem show and
btrfs filesystem df output for that filesystem, show to tell
you how much of the total space is chunk-allocated for each device,
df to tell you what those allocations are, and how much of the
chunk-allocated space is actually used, for each allocation type.

It's wise to keep track of the show output in particular, and
when the spread between used (allocated) and total for each
device gets low, under a few GiB, check btrfs fi df and see
what's using that space unnecessarily and then do a balance
to recover it, if possible.


-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread Duncan

ronnie sahlberg posted on Mon, 21 Jul 2014 09:46:07 -0700 as excerpted:

> On Sun, Jul 20, 2014 at 7:48 PM, Duncan <1i5t5.dun...@cox.net> wrote:
>> ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted:
>>
>>> If you assume a 12ms average seek time (normal for 7200RPM SATA
>>> drives), an 8.3ms rotational latency (half a rotation), an average
>>> 64kb write and a 100MB/S streaming write speed, each write comes in
>>> at ~21ms, which gives us ~47 IOPS.  With the 64KB write size, this
>>> comes out to ~3MB/S, DISK LIMITED.
>>
>>> The 5MB/S that TM is seeing is fine, considering the small files he
>>> says he has.
>>
> That is actually nonsense.
> Raid rebuild operates on the block/stripe layer and not on the
> filesystem layer.

If we were talking about a normal raid, yes.  But we're talking about 
btrFS, note the FS for filesystem, so indeed it *IS* the filesystem 
layer.  Now this particular "filesystem" /does/ happen to have raid 
properties as well, but it's definitely filesystem level...

> It does not matter at all what the average file size is.

... and the filesize /does/ matter.

> Raid rebuild is really only limited by disk i/o speed when performing a
> linear read of the whole spindle using huge i/o sizes,
> or, if you have multiple spindles on the same bus, the bus saturation
> speed.

Makes sense... if you're dealing at the raid level.  If we were talking 
about dmraid or mdraid... and they're both much more mature and 
optimized, as well, so 50 MiB/sec, per spindle in parallel, would indeed 
be a reasonable expectation for them.

But (barring bugs, which will and do happen at this stage of development) 
btrfs both makes far better data validity guarantees, and does a lot more 
complex processing what with COW and snapshotting, etc, of course in 
addition to the normal filesystem level stuff AND the raid-level stuff it 
does.

> Thus is is perfectly reasonabe to expect ~50MByte/second, per spindle,
> when doing a raid rebuild.

... And perfectly reasonable, at least at this point, to expect ~5 MiB/
sec total thruput, one spindle at a time, for btrfs.

> That is for the naive rebuild that rebuilds every single stripe. A
> smarter rebuild that knows which stripes are unused can skip the unused
> stripes and thus become even faster than that.
> 
> 
> Now, that the rebuild is off by an order of magnitude is by design but
> should be fixed at some stage, but with the current state of btrfs it is
> probably better to focus on other more urgent areas first.

Because of all the extra work it does, btrfs may never get to full 
streaming speed across all spindles at once.  But it can and will 
certainly get much better than it is, once the focus moves to 
optimization.  *AND*, because it /does/ know which areas of the device 
are actually in use, once btrfs is optimized, it's quite likely that 
despite the slower raw speed, because it won't have to deal with the 
unused area, at least with the typically 20-60% unused filesystems most 
people run, rebuild times will match or be faster than raid-layer-only 
technologies that must rebuild the entire device, because they do /not/ 
know which areas are unused.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread Wang Shilong


On 07/21/2014 10:00 PM, TM wrote:

Wang Shilong  cn.fujitsu.com> writes:


Just my two cents:

Since 'btrfs replace' support RADI10, I suppose using replace
operation is better than 'device removal and add'.

Another Question is related to btrfs snapshot-aware balance.
How many snapshots did you have in your system?

Of course, During balance/resize/device removal operations,
you could still snapshot, but fewer snapshots should speed things up!

Anyway 'btrfs replace' is implemented more effective than
'device remova and add'.



Hi Wang,
just one subvolume, no snaphots or anything else.

device replace: to tell you the truth I have not used it in the past. Most
of my testing was done 2 years ago. So in this 'kind of production' system I
did not try it. But if I knew that it was faster, perhaps I could have used
it. Anyone has statistics for such a replace and the time it takes?

I don't have specific statistics about this. The conclusion come from
implementation differences between replace and 'device removal'.



Also, can replace be used when one device is missing? Cant find
documentation. eg.
btrfs replace start missing /dev/sdXX
The latest btrfs-progs include man page of btrfs-replace. Actually, you 
could use it

something like:

btrfs replace start  |  

You could use 'btrfs file show' to see missing device id. and then run 
btrfs replace.


Thanks,
Wang



TM


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread Wang Shilong


On 07/21/2014 10:00 PM, TM wrote:

Wang Shilong  cn.fujitsu.com> writes:


Just my two cents:

Since 'btrfs replace' support RADI10, I suppose using replace
operation is better than 'device removal and add'.

Another Question is related to btrfs snapshot-aware balance.
How many snapshots did you have in your system?

Of course, During balance/resize/device removal operations,
you could still snapshot, but fewer snapshots should speed things up!

Anyway 'btrfs replace' is implemented more effective than
'device remova and add'.



Hi Wang,
just one subvolume, no snaphots or anything else.

device replace: to tell you the truth I have not used it in the past. Most
of my testing was done 2 years ago. So in this 'kind of production' system I
did not try it. But if I knew that it was faster, perhaps I could have used
it. Anyone has statistics for such a replace and the time it takes?

I don't have specific statistics about this. The conclusion come from
implementation differences between replace and 'device removal'.




Also, can replace be used when one device is missing? Cant find
documentation. eg.
btrfs replace start missing /dev/sdXX


TM


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC errors during balance

2014-07-21 Thread Marc Joliet

Am Tue, 22 Jul 2014 00:30:57 +0200
schrieb Marc Joliet :

> Am Mon, 21 Jul 2014 15:22:16 +0200
> schrieb Marc Joliet :
> 
> > Am Sun, 20 Jul 2014 21:44:40 +0200
> > schrieb Marc Joliet :
> > 
> > [...]
> > > What I did:
> > > 
> > > - delete the single largest file on the file system, a 12 GB VM image, 
> > > along
> > >   with all subvolumes that contained it
> > > - rsync it over again
> > [...]
> > 
> > I want to point out at this point, though, that doing those two steps freed 
> > a
> > disproportionate amount of space.  The image file is only 12 GB, and it 
> > hadn't
> > changed in any of the snapshots (I haven't used this VM since June), so that
> > "subvolume delete -c " returned after a few seconds. Yet 
> > deleting it
> > seems to have freed up twice as much. You can see this from the "filesystem 
> > df"
> > output: before, "used" was at 229.04 GiB, and after deleting it and copying 
> > it
> > back (and after a day's worth of backups) went down to 218 GiB.
> > 
> > Does anyone have any idea how this happened?
> > 
> > Actually, now I remember something that is probably related: when I first
> > moved to my current backup scheme last week, I first copied the data from 
> > the
> > last rsnapshot based backup with "cp --reflink" to the new backup location, 
> > but
> > forgot to use "-a".  I interrupted it and ran "cp -a -u --reflink", but it 
> > had
> > already copied a lot, and I was too impatient to start over; after all, the
> > data hadn't changed.  Then, when rsync (with --inplace) ran for the first 
> > time,
> > all of these files with wrong permissions and different time stamps were 
> > copied
> > over, but for some reason, the space used increased *greatly*; *much* more 
> > than
> > I would expect from changed metadata.
> > 
> > The total size of the file system data should be around 142 GB (+ 
> > snapshots),
> > but, well, it's more than 1.5 times as much.
> > 
> > Perhaps cp --reflink treats hard links differently than expected?  I would 
> > have
> > expected the data pointed to by the hard link to have been referenced, but
> > maybe something else happened?
> 
> Hah, OK, apparently when my daily backup removed the oldest daily snapshot, it
> freed up whatever was taking up so much space, so as of now the file system
> uses only 169.14 GiB (from 218).  Weird.

And now that the background deletion of the old snapshots is done, the file
system ended up at:

# btrfs filesystem df /run/media/marcec/MARCEC_BACKUP
Data, single: total=219.00GiB, used=140.13GiB
System, DUP: total=32.00MiB, used=36.00KiB
Metadata, DUP: total=4.50GiB, used=2.40GiB
unknown, single: total=512.00MiB, used=0.00

I don't know how reliable du is for this, but I used it to estimate how much
used data I should expect, and I get 138 GiB.  That means that the snapshots
yield about 2 GiB "overhead", which is very reasonable, I think.  Obviously
I'll be starting a full balance now.

I still think this whole... thing is very odd, hopefully somebody can shed
some light on it for me (maybe it's obvious, but I don't see it).

-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup


signature.asc
Description: PGP signature

Re: ENOSPC errors during balance

2014-07-21 Thread Marc Joliet

Am Mon, 21 Jul 2014 15:22:16 +0200
schrieb Marc Joliet :

> Am Sun, 20 Jul 2014 21:44:40 +0200
> schrieb Marc Joliet :
> 
> [...]
> > What I did:
> > 
> > - delete the single largest file on the file system, a 12 GB VM image, along
> >   with all subvolumes that contained it
> > - rsync it over again
> [...]
> 
> I want to point out at this point, though, that doing those two steps freed a
> disproportionate amount of space.  The image file is only 12 GB, and it hadn't
> changed in any of the snapshots (I haven't used this VM since June), so that
> "subvolume delete -c " returned after a few seconds. Yet deleting 
> it
> seems to have freed up twice as much. You can see this from the "filesystem 
> df"
> output: before, "used" was at 229.04 GiB, and after deleting it and copying it
> back (and after a day's worth of backups) went down to 218 GiB.
> 
> Does anyone have any idea how this happened?
> 
> Actually, now I remember something that is probably related: when I first
> moved to my current backup scheme last week, I first copied the data from the
> last rsnapshot based backup with "cp --reflink" to the new backup location, 
> but
> forgot to use "-a".  I interrupted it and ran "cp -a -u --reflink", but it had
> already copied a lot, and I was too impatient to start over; after all, the
> data hadn't changed.  Then, when rsync (with --inplace) ran for the first 
> time,
> all of these files with wrong permissions and different time stamps were 
> copied
> over, but for some reason, the space used increased *greatly*; *much* more 
> than
> I would expect from changed metadata.
> 
> The total size of the file system data should be around 142 GB (+ snapshots),
> but, well, it's more than 1.5 times as much.
> 
> Perhaps cp --reflink treats hard links differently than expected?  I would 
> have
> expected the data pointed to by the hard link to have been referenced, but
> maybe something else happened?

Hah, OK, apparently when my daily backup removed the oldest daily snapshot, it
freed up whatever was taking up so much space, so as of now the file system
uses only 169.14 GiB (from 218).  Weird.

-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup


signature.asc
Description: PGP signature

Testing with flaky disk

2014-07-21 Thread ronnie sahlberg

List, btrfs developers.

I started working on a test tool for SCSI initiators and filesystem folks.
It is a iSCSI target that implements a bad flaky disks where you
can set precise controls of how/what is broken which you can use to test
error and recovery paths in the initiator/filesystem.

The tool is available at :
https://github.com/rsahlberg/flaky-stgt.git
and is a modified version of the TGTD iscsi target.


Right now it is just an initial prototype and it needs more work to
add more types of errors as well as making it more userfriendly.
But it is still useful enough to illustrate certain failure cases
which could be helpful to btrfs and others.


Let me illustrate. Lets start by creating a BTRFS filesystem spanning
three 1G disks:

#
# Create three disks and export them through flaky iSCSI
#
truncate -s 1G /data/tmp/disk1.img
truncate -s 1G /data/tmp/disk2.img
truncate -s 1G /data/tmp/disk3.img

killall -9 tgtd
./usr/tgtd -f -d 1 &

sleep 3

./usr/tgtadm --op new --mode target --tid 1 -T iqn.ronnie.test

./usr/tgtadm --op new --mode logicalunit --tid 1 --lun 1 -b
/data/tmp/disk1.img --blocksize=4096
./usr/tgtadm --op new --mode logicalunit --tid 1 --lun 2 -b
/data/tmp/disk2.img --blocksize=4096
./usr/tgtadm --op new --mode logicalunit --tid 1 --lun 3 -b
/data/tmp/disk3.img --blocksize=4096

./usr/tgtadm --op bind --mode target --tid 1 -I ALL


#
# connect to the three disks
#
iscsiadm --mode discoverydb --type sendtargets --portal 127.0.0.1 --discover
iscsiadm --mode node --targetname iqn.ronnie.test --portal
127.0.0.1:3260 --login
#
# check dmesg, you should now have three new 1G disks
#
# Use: iscsiadm --mode node --targetname iqn.ronnie.test \
#  --portal 127.0.0.1:3260 --logout
# to disconnect the disks when you are finished.


# create a btrfs filesystem
mkfs.btrfs -f -d raid1
/dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.ronnie.test-lun-1
/dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.ronnie.test-lun-2
/dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.ronnie.test-lun-3

# mount the filesystem
mount /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.ronnie.test-lun-1 /mnt


Then we can proceed to copy a bunch of data to the filesystem so that
there will be some blocks used.


Now we can see how/what happens in the case of a single bad disk.
Lets say the disk is gone bad,   it is still possible to read from the
disk but all writes fail with medium error.
Perhaps this is similar to the case of a cheap disk that has
completely run out of blocks to reallocate to?


===
# make all writes to the third disk fail with write error.
# 3 - MEDIUM ERROR
# 0x0c02 - WRITE ERROR AUTOREALLOCATION FAILED
#
./usr/tgtadm --mode error --op new --tid 1 --lun 3 --error
op=WRITE10,lba=0,len=,pct=100,pause=0,repeat=0,action=CHECK_CONDITION,key=3,asc=0x0c02

# To show all current error injects:
# ./usr/tgtadm --mode error --op show
#
# To delete/clear all current error injects:
# ./usr/tgtadm --mode error --op delete
===



If you now know that this disk has gone bad,  you could try to delete
the device :

btrfs device delete
/dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.ronnie.test-lun-3 /mnt

but this will probably not work, since at least to semi-recent
versions of btrfs you can not remove a device from the filesystem
UNLESS you can also write to the device.

Thus making it impossible to remove the bad device in other ways that
physically removing the device.
This is suboptimal from a data integrity point of view since if the
disk is readable, it
can potentially still contain valid copies of the data which might be
silently errored
on the other mirror.

At some stage, from a data integrity and data robustness standpoint,
it would be nice to be able to device delete a device that is
readable, and contain a valid copy of the data, but still unwriteable.


There is a bunch of other things you can test and emulate with this too.
I have only tested this with semi-recent versions of btrfs and not the
latest version.
I will wait until the current versions of btrfs becomes more
stable/robust before I
will start experimenting with it.


Since I think this could be invaluably useful for a filesystem
developer, please have a look. I am more than happy to add additional
features that would make it even more useful for
filesystem-error-path-and-recovery-testing



regards
ronnie sahlberg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread Chris Murphy

On Jul 21, 2014, at 10:46 AM, ronnie sahlberg  wrote:

> On Sun, Jul 20, 2014 at 7:48 PM, Duncan <1i5t5.dun...@cox.net> wrote:
>> ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted:
>> 
>>> If you assume a 12ms average seek time (normal for 7200RPM SATA drives),
>>> an 8.3ms rotational latency (half a rotation), an average 64kb write and
>>> a 100MB/S streaming write speed, each write comes in at ~21ms, which
>>> gives us ~47 IOPS.  With the 64KB write size, this comes out to ~3MB/S,
>>> DISK LIMITED.
>> 
>>> The 5MB/S that TM is seeing is fine, considering the small files he says
>>> he has.
>> 
>> Thanks for the additional numbers supporting my point. =:^)
>> 
>> I had run some of the numbers but not to the extent you just did, so I
>> didn't know where 5 MiB/s fit in, only that it wasn't entirely out of the
>> range of expectation for spinning rust, given the current state of
>> optimization... or more accurately the lack thereof, due to the focus
>> still being on features.
>> 
> 
> That is actually nonsense.
> Raid rebuild operates on the block/stripe layer and not on the filesystem 
> layer.

Not on Btrfs. It is on the filesystem layer. However, a rebuild is about 
replicating metadata (up to 256MB) and data (up to 1GB) chunks. For raid10, 
those are further broken down into 64KB strips. So the smallest size "unit" for 
replication during a rebuild on Btrfs would be 64KB.

Anyway 5MB/s seems really low to me, so I'm suspicious something else is going 
on. I haven't done a rebuild in a couple months, but my recollection is it's 
always been as fast as the write performance of a single device in the btrfs 
volume.

I'd be looking in dmesg for any of the physical drives being reset, having read 
or write errors, and I'd do some individual drive testing to see if the problem 
can be isolated. And if that's not helpful, well, this is really tedious and 
verbose amounts of information but it might reveal some issue is to capture 
actual commands going to physical devices:

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg34886.html

My expectation (i.e. I'm guessing) based on previous testing is that whether 
raid1 or raid10, the actual read/write commands will each be 256KB in size. 
Btrfs rebuild is basically designed to be a sequential operation. This could 
maybe fall apart if there were somehow many minimally full chunks, which is 
probably unlikely.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: `btrfsck: extent_io.c:612: free_extent_buffer: Assertion `!(eb->flags & 1)' failed.` in `btrfsck`

2014-07-21 Thread Karl-Philipp Richter

Hi,
I could `btrfsck --repair` the sparse file with Linux 3.15.6-utopic from
http://kernel.ubuntu.com/~kernel-ppa/mainline/ and btrfsck 3.12-1 (from
btrfs-tools package in Ubuntu 14.04).

Thanks for your hints, Wang!

All the best,
Karl

Am 18.07.2014 14:13, schrieb Wang Shilong:
> 
> Hi,
> 
> There are some patches for fsck flighting, they  are integrated  in David's 
> branches.
> You can pull from David's latest branch, and see if it helps:
> 
> https://github.com/kdave/btrfs-progs  integration-20140704
> 
> Have a try and see if it helps anyway.
> 
> Thanks,
> Wang
> 
>> Hi together,
>> I'm experiencing the following issues when I invoke `btrfsck` on a
>> sparse file image with a GPT and one (the only) btrfs partition attached
>> to a loop device
>>
>>$ sudo btrfsck --repair --init-csum-tree --init-extent-tree -b
>> /dev/loop0p1
>>Incorrect local backref count on 128510738432 root 5 owner 3849475
>> offset 0 found 1 wanted 0 back 0xbab41270
>>backpointer mismatch on [128510738432 4096]
>>ref mismatch on [128510742528 12288] extent item 0, found 1
>>btrfsck: extent_io.c:612: free_extent_buffer: Assertion `!(eb->flags
>> & 1)' failed.
>>
>>$ sudo btrfsck --repair --init-csum-tree --init-extent-tree /dev/loop0p1
>>Incorrect local backref count on 128510726144 root 5 owner 3849470
>> offset 0 found 1 wanted 0 back 0xbbcb9500
>>backpointer mismatch on [128510726144 12288]
>>ref mismatch on [128510738432 4096] extent item 0, found 1
>>adding new data backref on 128510738432 root 5 owner 3849475 offset
>> 0 found 1
>>Backref 128510738432 root 5 owner 3849475 offset 0 num_refs 0 not
>> found in extent tree
>>Incorrect local backref count on 128510738432 root 5 owner 3849475
>> offset 0 found 1 wanted 0 back 0xbbcb9630
>>backpointer mismatch on [128510738432 4096]
>>ref mismatch on [128510742528 12288] extent item 0, found 1
>>btrfsck: extent_io.c:612: free_extent_buffer: Assertion `!(eb->flags
>> & 1)' failed.
>>
>>$ sudo btrfsck --repair /dev/loop0p1
>>Incorrect local backref count on 130861096960 root 5 owner 22733727
>> offset 0 found 1 wanted 0 back 0xc7c7d170
>>backpointer mismatch on [130861096960 8192]
>>ref mismatch on [130861105152 8192] extent item 0, found 1
>>btrfsck: extent_io.c:612: free_extent_buffer: Assertion `!(eb->flags
>> & 1)' failed.
>>
>>$ sudo btrfsck --repair /dev/loop0p1
>>Backref 130861096960 root 5 owner 22733727 offset 0 num_refs 0 not
>> found in extent tree
>>Incorrect local backref count on 130861096960 root 5 owner 22733727
>> offset 0 found 1 wanted 0 back 0xc7f31170
>>backpointer mismatch on [130861096960 8192]
>>ref mismatch on [130861105152 8192] extent item 0, found 1
>>btrfsck: extent_io.c:612: free_extent_buffer: Assertion `!(eb->flags
>> & 1)' failed.
>>
>> I'm using `btrfs-progs` 24cf4d8c3ee924b474f68514e0167cc2e602a48d on
>> Linux 3.16-rc5 (anything else, i.e. older versions, give me immediate
>> error after start because errornous file system)
>>
>> I'd like to know whether this (assertion) error is related to a bug or
>> missing feature in btrfs-progs and might be fixed at some point or
>> whether this might indicate a completely messed up btrfs.
>>
>> Best regards,
>> Karl-P. Richter
>>
> 



signature.asc
Description: OpenPGP digital signature

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread ronnie sahlberg

On Sun, Jul 20, 2014 at 7:48 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted:
>
>> If you assume a 12ms average seek time (normal for 7200RPM SATA drives),
>> an 8.3ms rotational latency (half a rotation), an average 64kb write and
>> a 100MB/S streaming write speed, each write comes in at ~21ms, which
>> gives us ~47 IOPS.  With the 64KB write size, this comes out to ~3MB/S,
>> DISK LIMITED.
>
>> The 5MB/S that TM is seeing is fine, considering the small files he says
>> he has.
>
> Thanks for the additional numbers supporting my point. =:^)
>
> I had run some of the numbers but not to the extent you just did, so I
> didn't know where 5 MiB/s fit in, only that it wasn't entirely out of the
> range of expectation for spinning rust, given the current state of
> optimization... or more accurately the lack thereof, due to the focus
> still being on features.
>

That is actually nonsense.
Raid rebuild operates on the block/stripe layer and not on the filesystem layer.
It does not matter at all what the average file size is.

Raid rebuild is really only limited by disk i/o speed when performing
a linear read of the whole spindle using huge i/o sizes,
or, if you have multiple spindles on the same bus, the bus saturation speed.

Thus is is perfectly reasonabe to expect ~50MByte/second, per spindle,
when doing a raid rebuild.
That is for the naive rebuild that rebuilds every single stripe. A
smarter rebuild that knows which stripes are unused can skip the
unused stripes and thus become even faster than that.

Now, that the rebuild is off by an order of magnitude is by design but
should be fixed at some stage, but with the current state of btrfs it
is probably better to focus on other more urgent areas first.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Q: BTRFS_IOC_DEFRAG_RANGE and START_IO

2014-07-21 Thread Timofey Titovets

I working on readahead in systemd and try to complete todo for it.
One of todos it is:
 readahead: use BTRFS_IOC_DEFRAG_RANGE instead of BTRFS_IOC_DEFRAG
ioctl, with START_IO

Can someone explain what start_io flag in BTRFS_IOC_DEFRAG_RANGE do?
Just force write data after defragment or do something else?
This flag mean what btrfs can guarantee data consistency after defragment?

Thanks for any explanation!

-- 
Best regards,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread TM

Wang Shilong  cn.fujitsu.com> writes:

> Just my two cents:
> 
> Since 'btrfs replace' support RADI10, I suppose using replace
> operation is better than 'device removal and add'.
> 
> Another Question is related to btrfs snapshot-aware balance.
> How many snapshots did you have in your system?
> 
> Of course, During balance/resize/device removal operations,
> you could still snapshot, but fewer snapshots should speed things up!
> 
> Anyway 'btrfs replace' is implemented more effective than
> 'device remova and add'.
> 

Hi Wang,
just one subvolume, no snaphots or anything else.

device replace: to tell you the truth I have not used it in the past. Most
of my testing was done 2 years ago. So in this 'kind of production' system I
did not try it. But if I knew that it was faster, perhaps I could have used
it. Anyone has statistics for such a replace and the time it takes?

Also, can replace be used when one device is missing? Cant find
documentation. eg.
btrfs replace start missing /dev/sdXX

TM

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC errors during balance

2014-07-21 Thread Marc Joliet

Am Sun, 20 Jul 2014 21:44:40 +0200
schrieb Marc Joliet :

[...]
> What I did:
> 
> - delete the single largest file on the file system, a 12 GB VM image, along
>   with all subvolumes that contained it
> - rsync it over again
[...]

I want to point out at this point, though, that doing those two steps freed a
disproportionate amount of space.  The image file is only 12 GB, and it hadn't
changed in any of the snapshots (I haven't used this VM since June), so that
"subvolume delete -c " returned after a few seconds. Yet deleting it
seems to have freed up twice as much. You can see this from the "filesystem df"
output: before, "used" was at 229.04 GiB, and after deleting it and copying it
back (and after a day's worth of backups) went down to 218 GiB.

Does anyone have any idea how this happened?

Actually, now I remember something that is probably related: when I first
moved to my current backup scheme last week, I first copied the data from the
last rsnapshot based backup with "cp --reflink" to the new backup location, but
forgot to use "-a".  I interrupted it and ran "cp -a -u --reflink", but it had
already copied a lot, and I was too impatient to start over; after all, the
data hadn't changed.  Then, when rsync (with --inplace) ran for the first time,
all of these files with wrong permissions and different time stamps were copied
over, but for some reason, the space used increased *greatly*; *much* more than
I would expect from changed metadata.

The total size of the file system data should be around 142 GB (+ snapshots),
but, well, it's more than 1.5 times as much.

Perhaps cp --reflink treats hard links differently than expected?  I would have
expected the data pointed to by the hard link to have been referenced, but
maybe something else happened?

-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup


signature.asc
Description: PGP signature

Re: ENOSPC errors during balance

2014-07-21 Thread Brendan Hide


On 20/07/14 14:59, Duncan wrote:

Marc Joliet posted on Sun, 20 Jul 2014 12:22:33 +0200 as excerpted:


On the other hand, the wiki [0] says that defragmentation (and
balancing) is optional, and the only reason stated for doing either is
because they "will have impact on performance".

Yes.  That's what threw off the other guy as well.  He decided to skip it
for the same reason.

If I had a wiki account I'd change it, but for whatever reason I tend to
be far more comfortable writing list replies, sometimes repeatedly, than
writing anything on the web, which I tend to treat as read-only.  So I've
never gotten a wiki account and thus haven't changed it, and apparently
the other guy with the problem and anyone else that knows hasn't changed
it either, so the conversion page still continues to underemphasize the
importance of completing the conversion steps, including the defrag, in
proper order.

I've inserted information specific to this in the wiki. Others with wiki 
accounts, feel free to review:

https://btrfs.wiki.kernel.org/index.php/Conversion_from_Ext3#Before_first_use

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: Add show_path function for btrfs_super_ops.

2014-07-21 Thread Qu Wenruo

show_path() function in struct super_operations is used to output
subtree mount info for mountinfo.
Without the implement of show_path() function, user can not found where
each subvolume is mounted if using 'subvolid=' mount option.
(When mounted with 'subvol=' mount option, vfs is aware of subtree mount
and can to the path resolve by vfs itself)

With this patch, end users will be able to use findmnt(8) or other
programs reading mountinfo to find which btrfs subvolume is mounted.

Though we use fs_info->subvol_sem to protect show_path() from subvolume
destroying/creating, if user renames/moves the parent non-subvolume
dir of a subvolume, it is still possible that concurrency may happen and
cause btrfs_search_slot() fails to find the desired key.
In that case, we just return -EBUSY and info user to try again since
extra locking like locking the whole subvolume tree is too expensive for
such usage.

Reported-by: Stefan G.Weichinger 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h |   2 +
 fs/btrfs/ioctl.c |   4 +-
 fs/btrfs/super.c | 112 +++
 3 files changed, 116 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index be91397..63fba05 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3881,6 +3881,8 @@ void btrfs_get_block_group_info(struct list_head 
*groups_list,
struct btrfs_ioctl_space_info *space);
 void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock,
   struct btrfs_ioctl_balance_args *bargs);
+int btrfs_search_path_in_tree(struct btrfs_fs_info *info,
+ u64 tree_id, u64 dirid, char *name);
 
 
 /* file.c */
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 47aceb4..c2bd6b5 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2218,8 +2218,8 @@ static noinline int btrfs_ioctl_tree_search_v2(struct 
file *file,
  * Search INODE_REFs to identify path name of 'dirid' directory
  * in a 'tree_id' tree. and sets path name to 'name'.
  */
-static noinline int btrfs_search_path_in_tree(struct btrfs_fs_info *info,
-   u64 tree_id, u64 dirid, char *name)
+int btrfs_search_path_in_tree(struct btrfs_fs_info *info,
+ u64 tree_id, u64 dirid, char *name)
 {
struct btrfs_root *root;
struct btrfs_key key;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 8e16bca..b5ece81 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1831,6 +1831,117 @@ static int btrfs_show_devname(struct seq_file *m, 
struct dentry *root)
return 0;
 }
 
+static char *str_prepend(char *dest, char *src)
+{
+   memmove(dest + strlen(src), dest, strlen(dest) + 1);
+   memcpy(dest, src, strlen(src));
+   return dest;
+}
+
+static int alloc_mem_if_needed(char **dest, char *src, int *len)
+{
+   char *tmp;
+
+   if (unlikely(strlen(*dest) + strlen(src) > *len)) {
+   *len *= 2;
+   tmp = krealloc(*dest, *len, GFP_NOFS);
+   if (!tmp) {
+   return -ENOMEM;
+   }
+   *dest = tmp;
+   }
+   return 0;
+}
+
+static int btrfs_show_path(struct seq_file *m, struct dentry *mount_root)
+{
+   struct inode *inode = mount_root->d_inode;
+   struct btrfs_root *subv_root = BTRFS_I(inode)->root;
+   struct btrfs_fs_info *fs_info = subv_root->fs_info;
+   struct btrfs_root *tree_root = fs_info->tree_root;
+   struct btrfs_root_ref *ref;
+   struct btrfs_key key;
+   struct btrfs_key found_key;
+   struct btrfs_path *path = NULL;
+   char *name = NULL;
+   char *buf = NULL;
+   int ret = 0;
+   int len;
+   u64 dirid = 0;
+   u16 namelen;
+
+   name = kmalloc(PAGE_SIZE, GFP_NOFS);
+   len = PAGE_SIZE;
+   buf = kmalloc(BTRFS_INO_LOOKUP_PATH_MAX, GFP_NOFS);
+   path = btrfs_alloc_path();
+   if (!name || !buf || !path) {
+   ret = -ENOMEM;
+   goto out_free;
+   }
+   *name = '/';
+   *(name + 1) = '\0';
+
+   key.objectid = subv_root->root_key.objectid;
+   key.type = BTRFS_ROOT_BACKREF_KEY;
+   key.offset = 0;
+   down_read(&fs_info->subvol_sem);
+   while (key.objectid != BTRFS_FS_TREE_OBJECTID) {
+   ret = btrfs_search_slot_for_read(tree_root, &key, path, 1, 1);
+   if (ret < 0)
+   goto out;
+   if (ret) {
+   ret = -ENOENT;
+   goto out;
+   }
+   btrfs_item_key_to_cpu(path->nodes[0], &found_key,
+ path->slots[0]);
+   if (found_key.objectid != key.objectid ||
+   found_key.type != BTRFS_ROOT_BACKREF_KEY) {
+   ret = -ENOENT;
+   goto out;
+   }
+   /* append the subvol name first */
+

[PATCH] btrfs-progs: check if there is required kernel send stream version

2014-07-21 Thread Anand Jain

When kernel does not have the send stream version 2 patches,
the btrfs send with --stream-version 2 would fail with out
giving the details what is wrong. This patch will help to
identify correctly that required kernel patches are missing.

Signed-off-by: Anand Jain 
---
 cmds-send.c | 13 +
 send.h  |  2 ++
 utils.c | 17 +
 utils.h |  1 +
 4 files changed, 33 insertions(+)

diff --git a/cmds-send.c b/cmds-send.c
index 9a73b32..0c20a6f 100644
--- a/cmds-send.c
+++ b/cmds-send.c
@@ -435,6 +435,7 @@ int cmd_send(int argc, char **argv)
u64 parent_root_id = 0;
int full_send = 1;
int new_end_cmd_semantic = 0;
+   int k_sstream;
 
memset(&send, 0, sizeof(send));
send.dump_fd = fileno(stdout);
@@ -544,6 +545,18 @@ int cmd_send(int argc, char **argv)
ret = 1;
goto out;
}
+
+   /* check if btrfs kernel supports send stream ver 2 */
+   if (g_stream_version > BTRFS_SEND_STREAM_VERSION_1) {
+   k_sstream = 
btrfs_read_sysfs(BTRFS_SEND_STREAM_VER_PATH);
+   if (k_sstream < g_stream_version) {
+   fprintf(stderr,
+   "ERROR: Need btrfs kernel send stream 
version %d or above, %d\n",
+   BTRFS_SEND_STREAM_VERSION_2, k_sstream);
+   ret = 1;
+   goto out;
+   }
+   }
break;
case 's':
g_total_data_size = 1;
diff --git a/send.h b/send.h
index ea56965..d7a171b 100644
--- a/send.h
+++ b/send.h
@@ -24,6 +24,8 @@ extern "C" {
 #endif
 
 #define BTRFS_SEND_STREAM_MAGIC "btrfs-stream"
+#define BTRFS_SEND_STREAM_VER_PATH "/sys/fs/btrfs/send/stream_version"
+
 #define BTRFS_SEND_STREAM_VERSION_1 1
 #define BTRFS_SEND_STREAM_VERSION_2 2
 /* Max supported stream version. */
diff --git a/utils.c b/utils.c
index e144dfd..e3d4fa2 100644
--- a/utils.c
+++ b/utils.c
@@ -2681,3 +2681,20 @@ int fsid_to_mntpt(__u8 *fsid, char *mntpt, int *mnt_cnt)
 
return ret;
 }
+
+int btrfs_read_sysfs(char path[PATH_MAX])
+{
+   int fd;
+   char val;
+
+   fd = open(path, O_RDONLY);
+   if (fd < 0)
+   return -errno;
+
+   if (read(fd, &val, sizeof(char)) < sizeof(char)) {
+   close(fd);
+   return -EINVAL;
+   }
+   close(fd);
+   return atoi((const char *)&val);
+}
diff --git a/utils.h b/utils.h
index ddf31cf..0c9b65f 100644
--- a/utils.h
+++ b/utils.h
@@ -153,5 +153,6 @@ static inline u64 btrfs_min_dev_size(u32 leafsize)
return 2 * (BTRFS_MKFS_SYSTEM_GROUP_SIZE +
btrfs_min_global_blk_rsv_size(leafsize));
 }
+int btrfs_read_sysfs(char path[PATH_MAX]);
 
 #endif
-- 
2.0.0.153.g79d

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] xfstest/btrfs: check for matching kernel send stream ver 2

2014-07-21 Thread Anand Jain

The test case btrfs/049 is relevant to send stream version 2, and
needs kernel patches as well. So call _notrun if there isn't
matching kernel support as shown below

btrfs/047[not run] Missing btrfs kernel patch for send stream version 
2, skipped this test
Not run: btrfs/047

Signed-off-by: Anand Jain 
---
 common/rc | 5 +
 1 file changed, 5 insertions(+)

diff --git a/common/rc b/common/rc
index 4a6511f..1c914bb 100644
--- a/common/rc
+++ b/common/rc
@@ -2223,6 +2223,11 @@ _require_btrfs_send_stream_version()
if [ $? -ne 0 ]; then
_notrun "Missing btrfs-progs send --stream-version command line 
option, skipped this test"
fi
+
+   # test if btrfs kernel supports send stream version 2
+   if [ ! -f /sys/fs/btrfs/send/stream_version ]; then
+   _notrun "Missing btrfs kernel patch for send stream version 2, 
skipped this test"
+   fi
 }
 
 _require_btrfs_mkfs_feature()
-- 
2.0.0.153.g79d

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC errors during balance

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: ENOSPC errors during balance

Re: ENOSPC errors during balance

Testing with flaky disk

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: `btrfsck: extent_io.c:612: free_extent_buffer: Assertion `!(eb->flags & 1)' failed.` in `btrfsck`

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Q: BTRFS_IOC_DEFRAG_RANGE and START_IO

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: ENOSPC errors during balance

Re: ENOSPC errors during balance

[PATCH] btrfs: Add show_path function for btrfs_super_ops.

[PATCH] btrfs-progs: check if there is required kernel send stream version

[PATCH] xfstest/btrfs: check for matching kernel send stream ver 2

17 matches

Site Navigation

Mail list logo

Footer information