Re: Uncorrectable errors on RAID-1?

2015-01-04 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On 01/03/2015 12:31 AM, Chris Murphy wrote:
 It's not a made to order hard drive industry. Maybe one day you'll
 be able to 3D print your own with its own specs.

And wookies did not live on endor.  What's your point?

 Sticking fingers in your ears doesn't change the fact there's a 
 measurable difference in support requirements.

Sure, just don't misrepresent one requirement for another.  Just
because I don't care about a warranty from the hardware manufacturer
does not mean I have no right to expect the kernel to perform
*reasonably* on that hardware.

 This is architecture astronaut territory.
 
 The system only has a terrible response for two reasons: 1. The
 user spec'd the wrong hardware for the use case; 2. The distro
 isn't automatically leveraging existing ways to mitigate that user
 mistake by changing either SCT ERC on the drives, or the SCSI
 command timer for each block device.

No, it has terrible response because the kernel either waits an
unreasonable time or fails the drive and kicks it out of the array
instead of trying to repair it.  Blaming the user for not buying
better hardware is not an appropriate response for the kernel failing
so badly to handle commonly available hardware that doesn't behave in
the most ideal way.

 Now, even though that solution *might* mean long recoveries on 
 occasion, it's still better than link reset behavior which is what
 we have today because it causes the underlying problem to be fixed
 by md/dm/Btrfs once the read error is reported. But no distro has 
 implemented this $500 man hour solution. Instead you're suggesting
 a $500,000 fix that will take hundreds of man hours and end user
 testing to find all the edge cases. It's like, seriously, WTF?

Seriously?  Treating a timeout the same way you treat an unrecoverable
media error is no herculean task.

 Ok well I think that's hubris unless you're a hard drive engineer. 
 You're referring to how drives behaved over a decade ago, when bad 
 sectors were persistent rather than remapped, and we had to scan
 the drive at format time to build a map so the bad ones wouldn't be
 used by the filesystem.

Remapping has nothing to do with it: we are talking about *read*
errors, which do not trigger a remap.

 http://www.seagate.com/files/www-content/support-content/documentation/product-manuals/en-us/Enterprise/Savvio/Savvio%2015K.3/100629381e.pdf

  That's a high end SAS drive. It's default is to retry up to 20
 times, which takes ~1.4 seconds, per sector. But also note how it
 says

20 retries on a 15,000 rpm drive only takes 80 milliseconds, not 1.4
seconds.  15,000 rpm / 60 seconds per minute = 250 rotations/retries
per second.

 Maybe you'd prefer seeing these big, cheap, green drives have 
 shorter ERC times, with a commensurate reality check with their 
 unrecoverable error rate, which right now is already two orders 
 magnitude higher than enterprise SAS drives. So what if this means 
 that rate is 3 or 4 orders magnitude higher?

20 retries vs. 200 retries does not reduce the URE rate by orders of
magnitude; more like 1% *maybe*.  200 vs 2000 makes no measurable
difference at all.


-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBCgAGBQJUqhCxAAoJENRVrw2cjl5RhDYH/RLbHXEPyjK4j6u33ElOyS5S
W5/nfiT1ZZjVAFxJwD0y/gt2L61hB1PQdlUjBm2NayExfCXn3sEuccAxvjMDrvsL
dFJOV8G/7GBbUfsD0uBustG5639QGc30bRzuiw/URT77zNf+T6+5SmTPSC3Oaj3j
fCcDdiKCwNcYiUF3/Q3gdh4XVI8wgoABHC2S/GqvRB+FmmqD6Yt6yG50TG5sPBzq
zSUSxWjOPwVinZOlPfCUCFr3buw+yzg5fclcvaNRStJM38gtK0UGgeIHFgCViHtN
0xNRCKWMu3XkfjfOI/cYVor79K4sQlz9K83Ja/UAMrOtopdlKjn9N04oIiPdsbg=
=u/i9
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Uncorrectable errors on RAID-1?

2014-12-30 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/29/2014 4:53 PM, Chris Murphy wrote:
 Get drives supporting configurable or faster recoveries. There's
 no way around this.

Practically available right now?  Sure.  In theory, no.

 This is a broken record topic honestly. The drives under
 discussion aren't ever meant to be used in raid, they're desktop
 drives, they're designed with long recoveries because it's
 reasonable to try to

The intention to use the drives in a raid is entirely at the
discretion of the user, not the manufacturer.  The only reason we are
even having this conversation is because the manufacturer has added a
misfeature that makes them sub-optimal for use in a raid.

 recover the data even in the face of delays rather than not recover
 at all. Whether there are also some design flaws in here I can't
 say because I'm not a hardware designer or developer but they are
 very clearly targeted at certain use cases and not others, not
 least of which is their error recovery time but also their
 vibration tolerance when multiple drives are in close proximity to
 each other.

Drives have no business whatsoever retrying for so long; every version
of DOS or Windows ever released has been able to report an IO error
and give the *user* the option of retrying it in the hopes that it
will work that time, because drives used to be sane and not keep
retrying a positively ridiculous number of times.

 If you don't like long recoveries, don't buy drives with long 
 recoveries. Simple.

Better to fix the software to deal with it sensibly instead of
encouraging manufacturers to engage in hamstringing their lower priced
products to coax more money out of their customers.

 The device will absolutely provide a specific error so long as its 
 link isn't reset prematurely, which happens to be the linux
 default behavior when combined with drives that have long error
 recovery times. Hence the recommendation is to increase the linux
 command timer value. That is the solution right now. If you want a
 different behavior someone has to write the code to do it because
 it doesn't exist yet, and so far there seems to be zero interest in
 actually doing that work, just some interest in hand waiving that
 it ought to exist, maybe.

If this is your way of saying patches welcome then it probably would
have been better just to say that.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUow8ZAAoJENRVrw2cjl5Rr9UH+wd3yJ1ZnoaxDG3JPCBq9MJb
Tb6nhjHovRDREeus4UWLESp9kYUyy5OfKmahARhM6AbaBXWYeleoD9SEtMahFXfn
/2Kn9yRBqZCBDloVQGNOUaSZyfhTRRl31cGABbbynRo6IDkLEfMQQPWgvz9ttch7
3aPciHhehs1CeseNuiiUPk6HIMb8lJLvgW5J1O5FwgXZ6Wyi9OZdoPL+prnFh2bP
5E2rGblYUHIUiLkOKFOOsEs8q2H9RICFJIBsz8KoPzjCDtdNETBF5mvx8bIUJpg0
Q7cQOo7IRxpFUL/7gnBtWgRIw3lvRY+SY2G+2YwaMiqdeuYcLCr853ONDYg0NCc=
=AYGW
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: I need to P. are we almost there yet?

2014-12-30 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/29/2014 7:20 PM, ashf...@whisperpc.com wrote:
 Just some background data on traditional RAID, and the chances of
 survival with a 2-drive failure.
 
 In traditional RAID-10, the chances of surviving a 2-drive failure
 is 66% on a 4-drive array, and approaches 100% as the number of
 drives in the array increase.
 
 In traditional RAID-0+1 (used to be common in low-end fake-RAID
 cards), the chances of surviving a 2-drive failure is 33% on a
 4-drive array, and approaches 50% as the number of drives in the
 array increase.

In terms of data layout, there is really no difference between raid-10
( or raid1+0 ) and raid0+1, aside from the designation you assign to
each drive.  With a dumb implementation of 0+1, any single drive
failure offlines the entire stripe, discarding the remaining good
disks in it, thus giving the probability you describe as the only
possible remaining failure(s) that do not result in the mirror also
failing is a drive in the same stripe as the original.  This however,
is only a deficiency of the implementation, not the data layout, as
all of the data on the first failed drive could be recovered from a
drive in the second stripe, so long as the second drive that failed
was any drive other than the one holding the duplicate data of the first.

This is partly why I agree with linux mdadm that raid10 is *not*
simply raid1+0; the latter is just a naive, degenerate implementation
of the former.

 In traditional RAID-1E, the chances of surviving a 2-drive failure
 is 66% on a 4-drive array, and approaches 100% as the number of
 drives in the array increase.  This is the same as for RAID-10.
 RAID-1E allows an odd number of disks to be actively used in the
 array.

What some vendors have called 1E is simply raid10 in the default
near layout to mdadm.  I prefer the higher performance offset
layout myself.

 I'm wondering which of the above the BTRFS implementation most
 closely resembles.

Unfortunately, btrfs just uses the naive raid1+0, so no 2 or 3 disk
raid10 arrays, and no higher performing offset layout.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUoxyuAAoJENRVrw2cjl5R72oH/1nypXV72Bk4PBeaGAwH7559
lL6JH80216lbhv8hHopIeXKe7uqPGFAE5F1ArChIi08HA+CqKr5cfPNzJPlobyFj
KNLzeXi+wnJO2mbvWnnJak83GVmvpBnYvS+22RCweDELCb3pulybleJnN4yVSL25
WpVfUGnAg5lQJdX2l6THeClWX6V47NKqD6iXbt9+jyADCK2yk/5+TVbS8tixFUtj
PBxe+XGNrkTREnPAAFy6BgwO2vCD92F6+mm/lHJ0fg7gOm41UE09gzabsCGQ9LFA
kk99c9WAnJdkTqUJVw49MEwmmhs/2gluKWTeaHONpBePoFIpQEjHI89TqBsKhY4=
=+oed
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: I need to P. are we almost there yet?

2014-12-30 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On 12/30/2014 06:17 PM, ashf...@whisperpc.com wrote:
 I believe that someone who understands the code in depth (and that
 may also be one of the people above) determine exactly how BTRFS
 implements RAID-10.

I am such a person.  I had a similar question a year or two ago (
specifically about raid10  ) so I both experimented and read the code
myself to find out.  I was disappointed to find that it won't do
raid10 on 3 disks since the chunk metadata describes raid10 as a
stripe layered on top of a mirror.

Jose's point was also a good one though; one chunk may decide to
mirror disks A and B, so a failure of A and C it could recover from,
but a different chunk could choose to mirror on disks A and C, so that
chunk would be lost if A and C fail.  It would probably be nice if the
chunk allocator tried to be more deterministic about that.


-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBCgAGBQJUo2M8AAoJENRVrw2cjl5RihoH/1ulWpEK6lPaYhBSBbmWQyGu
obJZBTbeMgBAfO9VMq9X2laUfmEprwYi8FuKnCwVgA1KyftFsaJngckqMoTtpwdI
IXx2X2++MjZBkFBUFRhGlSQcbDgeB/RbBx+Vtxi2dNq3/WgZyHRfIJT1moRrxY0V
UTH1kI7JsWg4blpdm+xW4o7UKds7JKHr5Th1PUH9SmJOdsBe2efIFQyC7hyuSQs0
gBUQzxmo3HcRzBtJwJjKRICU16VBN0NW7w3m/y6K1yIlkGi4U7MZgzMSUJw/BiMT
tGX48AhBH3D3R2sjmF2aO5suPaHEVYoZuqhKevKZfTGS7izSYA74LqrGHkq5QBk=
=ESya
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Uncorrectable errors on RAID-1?

2014-12-30 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On 12/30/2014 06:58 PM, Chris Murphy wrote:
 Practically available right now?  Sure.  In theory, no.
 
 I have no idea what this means. Such drives exist, you can buy them
 or not buy them.

I was referring to the no way around this part.  Currently you are
correct, but in theory the way around it is exactly the subject of
this thread.

 Clearly you have never owned a business, nor have you been involved
 in volume manufacturing or you wouldn't be so keen to demand one
 market subsidize another. 24x7 usage is a non-trivial quantity of
 additional wear and tear on the drive compared to 8 hour/day, 40
 hour/week duty cycle. But you seem to think that the manufacturer
 has no right to produce a cheaper one for the seldom used hardware,
 or a more expensive one for the constantly used hardware.

Just because I want a raid doesn't mean I need it to operate reliably
24x7.  For that matter, it has long been established that power
cycling drives puts more wear and tear on them and as a general rule,
leaving them on 24x7 results in them lasting longer.

 And of course you completely ignored, and deleted, my point about
 the difference in warranties.

Because I don't care?  It's nice and all that they warranty the more
expensive drive more, and it may possibly even mean that they are
actually more reliable ( but not likely ), but that doesn't mean that
the system should have an unnecessarily terrible response to the
behavior of the cheaper drives.  Is it worth recommending the more
expensive drives?  Sure... but the system should also handle the
cheaper drives with grace.

 Does the SATA specification require configurable SCT ERC? Does it 
 require even supporting SCT ERC? I think your argument is flawed
 by mis-distributing the economic burden while simultaneously
 denying one even exists or that these companies should just eat the
 cost differential if it does. In any case the argument is asinine.

There didn't used to be any such thing; drives simply did not *ever*
go into absurdly long internal retries so there was no need.  The fact
that they do these days I consider a misfeature, and one that *can* be
worked around in software, which is the point here.

 When the encoded data signal weakens, they effectively becomes
 fuzzy bits. Each read produces different results. Obviously this is
 a very rare condition or there'd be widespread panic. However, it's
 common and expected enough that the drive manufacturers are all, to
 very little varying degree, dealing with this problem in a similar
 way, which is multiple reads.

Sure, but the noise introduced by the read ( as opposed to the noise
in the actual signal on the platter ) isn't that large, and so
retrying 10,000 times isn't going to give any better results than
retrying say, 100 times, and if the user really desires that many
retries, they have always been able to do so in the software level
rather than depending on the drive to try that much.  There is no
reason for the drives to have increased their internal retries that
much, and then deliberately withed the essentially zero cost ability
to limit those internal retries, other than to drive customers to pay
for the more expensive models.

 Now you could say they're all in collusion with each other to
 screw users over, rather than having legitimate reasons for all of
 these retried. Unless you're a hard drive engineer, I'm unlikely to
 find such an argument compelling. Besides, it would also be a
 charge of fraud.

Calling it fraud might be a bit of a stretch, but yes, there is no
legitimate reason for *that* many retries since people have been
retrying failed reads in software for decades and the diminishing
returns that goes with increasing the number of retries.

 In the meantime, there already is a working software alternative: 
 (re)write over all sectors periodically. Perhaps every 6-12 months
 is sufficient to mitigate such signal weakening on marginal sectors
 that aren't persistently failing on writes. This can be done with
 a periodic reshape if it's md raid. It can be done with balance on 
 Btrfs. It can be done with resilvering on ZFS.

Is there any actual evidence that this is effective?  Or that the
recording degrades as a function of time?  I doubt it since I do have
data on drives that were last written 10 years ago that is still
readable.  Even if so, this is really a non sequitur since if the
signal has degraded making it hard to read, in a raid we can simply
recover using the other drives.  The issue here is whether we should
be doing such recovery sooner rather than waiting for the silly drive
to retry 100,000 times before giving up.


-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBCgAGBQJUo2p7AAoJENRVrw2cjl5RBRQH/iPeByoKWCBCNcSH+slHQpLu
UgFw1Sb0VhkcMV7LWGHRPVCOqOqRUyiDUIWBqjnnKAtGWvngqoVa8oCrYXYfgzeT
snarm36vtm5jWQygn62mpZKoFVby5ttKTP3+rwQi+OjZ3+EWKKVkuXRFYpwt5ylt
f/Xix2EpgMrl9hi8Bt8D/aLPtyPIF47D5vwa2nw7f5/gU0rKDfG9OZ4B7Bs1Jl0Q

Re: Uncorrectable errors on RAID-1?

2014-12-27 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On 12/23/2014 05:09 PM, Chris Murphy wrote:
 The timer in /sys is a kernel command timer, it's not a device
 timer even though it's pointed at a block device. You need to
 change that from 30 to something higher to get the behavior you
 want. It doesn't really make sense to say, timeout in 30 seconds,
 but instead of reporting a timeout, report it as a read error.
 They're completely different things.

The idea is not to give the drive a ridiculous amount of time to
recover without timing out, but for the timeout to be handled properly.

 There are all sorts of errors listed in libata so for all of them
 to get dumped into a read error doesn't make sense. A lot of those
 errors don't report back a sector, and the key part of the read
 error is what sector(s) have the problem so that they can be fixed.
 Without that information, the ability to fix it is lost. And it's
 the drive that needs to report this.

It is not lost.  The information is simply fuzzed from an exact
individual sector to a range of sectors in the timed out request.  In
an ideal world the drive would give up in a reasonable time and report
the failure, but if it doesn't, then we should deal with that in a
better way than hanging all IO for an unacceptably long time.

 Oven doesn't work, so lets spray gasoline on it and light it and
 the kitchen on fire so that we can cook this damn pizza! That's
 what I just read. Sorry. It doesn't seem like a good idea to me to
 map all errors as read errors.

How do you conclude that?  In the face of a timeout your choices are
between kicking the whole drive out of the array immediately, or
attempting to repair it by recovering the affected sector(s) and
rewriting them.  Unless that recovery attempt could cause more harm
than degrading the array, then where is the throwing gasoline on it
part?  This is simply a case of the device not providing a specific
error that says whether it can be recovered or not, so let's attempt
the recovery and see if it works instead of assuming that it won't and
possibly causing data loss that could be avoided.

 Any decent server SATA drive should support SCT ERC. The
 inexpensive WDC Red drives for NAS's all have it and by default are
 a reasonable 70 deciseconds last time I checked.

And yet it isn't supported on the cheaper but otherwise identical
greens, or the higher performing blues.  We should not be helping
vendors charge a premium for zero cost firmware features that are
required for raid use when they really aren't ( even if they are
nice to have ).


-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBCgAGBQJUn3U5AAoJENRVrw2cjl5RFIQIAJAr86Y5s8RWuL8/We/AlM5Q
JUuZGGaE1IGmMROdUAEzmj78L8lI2U3D95sERDKmd3aJosfpi1SVOExQZebSIqch
hhkLGC0FecxE5VC/67E2wwmfbropSk0mlA5Fbgx8mYf60iUHWcFUkc01kER3JGnd
xMI2jV0UpqVD/gY/a5O7Z7bPeHICQcIyXCN7MAbTMBrDWsYhDACQpij+aNXu5+ke
rCNV5c/VkYFQZ9aaMb6Mxmi9KOkCVv2+kBOsxwqPxlO5s9vKORDhxMp8XeJQEvhU
X2GAgS8r8gSGVdPutekXR1vB+TwhdMxftBWL9jcI1y05Y0z3GcOX+/90S9mrSaU=
=2tIU
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/18/2014 9:59 AM, Daniele Testa wrote:
 As seen above, I have a 410GB SSD mounted at /opt/drives/ssd. On 
 that partition, I have one single starse file, taking 302GB of
 space (max 315GB). The snapshots directory is completely empty.

So you don't have any snapshots or other subvolumes?

 However, for some weird reason, btrfs seems to think it takes
 404GB. The big file is a disk that I use in a virtual server and
 when I write stuff inside that virtual server, the disk-usage of
 the btrfs partition on the host keeps increasing even if the
 sparse-file is constant at 302GB. I even have 100GB of free
 disk-space inside that virtual disk-file. Writing 1GB inside the
 virtual disk-file seems to increase the usage about 4-5GB on the
 outside.

Did you flag the file as nodatacow?

 Does anyone have a clue on what is going on? How can the
 difference and behaviour be like this when I just have one single
 file? Is it also normal to have 672MB of metadata for a single
 file?

You probably have the data checksums enabled and that isn't
unreasonable for checksums on 302g of data.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUlHQyAAoJENRVrw2cjl5RZWEIAKfdDzlNVrD/IYDZ5wzIeg5P
DR5H8anGGc2QPTAD76vEX/XA7/j1Kg+PbQRHGdz6Iq2+Vq4CGno/yIi46oVVVYaL
H4XvuH7GvPJyzHJ+XCMHjPGLrSCBxgIm1XSluNXmFNCwqi/FONk8TUhWsw7JchaZ
yCVe/82YI+MLZhmJdudt48MeNFzW6LYi58dQo/JfYnTGnpZAFutdgBM7vLmnqLY2
WVLQUNHZsHBa7solttCuRtc4h8ku9FBObfKKYNPAEn1YWfx7bihWgPeBMH/blsza
yhpMq96OMhIfn2SmIZMSwGh2ys+AxQQfymYR69fyGYTIajHmJEhJUzltuQD9Yg8=
=Z9/S
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/19/2014 2:59 PM, Daniele Testa wrote:
 No, I don't have any snapshots or subvolumes. Only that single
 file.
 
 The file has both checksums and datacow on it. I will do chattr
 +C on the parent dir and re-create the file to make sure all files
 are marked as nodatacow.
 
 Should I also turn off checksums with the mount-flags if this 
 filesystem only contain big VM-files? Or is it not needed if I put
 +C on the parent dir?

If you don't want the overhead of those checksums, then yea.  Also I
would question why you are using btrfs to hold only big vm files in
the first place.  You would be better off using lvm thinp volumes
instead of files, though personally I prefer to just use regular lvm
volumes and manually allocate enough space.  It avoids the
fragmentation you get from thin provisioning ( or qcow2 ) at the cost
of a bit of overallocated space and the need to do some manual
resizing to add more if and when it is needed.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUlIwGAAoJENRVrw2cjl5RlGEH/1OYz07C/OjGBASA9IHTCVMV
NkYHnO3/s2+SOsafQj4ej/RifgX9aG43b8Y6z9XAdosG/X+8z7xRjW9Nic0H5beK
JZRpwP+02Dw02A3/RSPjGqJBeAmS8yi9yTlunnPaCau+m1kPYL4M/vFM8/hqrGeU
Jy+jbffX+XtOedBWptxnDVIyXpYskgVyH8AmQ9d3TGrv52jw/QY1BxkuoVG60hBU
Fk4Q8ed43C9zjCVihmkDOeER6Ygr1roDb1/gFLoeCk4FwVLO9Kusft2Qi2oXyHy1
iTkoVJan8NRzXBhrPtZexxQdewHSw9Z4wyHxlal3b/xIbRf6/DRwPRHfgG5djvM=
=AqC/
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/19/2014 4:15 PM, Josef Bacik wrote:
 Please God don't turn off of checksums.  Checksums are tracked in 
 metadata anyway, they won't show up in the data accounting.  Our
 csums are 8 bytes per block, so basic math says you are going to
 max out at 604 megabytes for that big of a file.

Yes, and it is exactly that metadata space he is complaining about.
So if you don't want to use up all of that space ( and have no use for
the checksums ), then you turn them off.

 Please people try to only take advice from people who know what
 they are talking about.  So unless it's from somebody who has
 commits in btrfs/btrfs-progs take their feedback with a grain of
 salt.  Thanks,

Well that is rather arrogant and rude.  For that matter, I *do* have
commits in btrfs-progs.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUlJ5gAAoJENRVrw2cjl5RZ5MIAI0Ok0q0hFTMcYYXu1U48R4Z
AsuRg6zQDMOa9C1SqZucH2cuiiaGU8XixKcscaquoJDzzaND2kuy+sxp0k2YQnGz
+/269OmZUtwjYil1NcSFTJiE2bYUAx1R+xWUGax/03NsXRr672f0EtAQ2sIitTaG
WsNUhiU0GREpQL6pK403fO79eD2vRmgCx2w50gB2OYPQYciJ+YN0YAJ7z8VEmUro
M9xqce2oc7haAHliDvazl+7IDRkkiZ7FcpSs2nBSqiHiUhgVaxuTzHZEXvUasE5l
LamJCwiSwuevWWPCDE4N/r7qVcamKM2K/DMvZCiOuPkSm3YkcVyrUd8x4i8OEJs=
=8R13
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-10 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/9/2014 10:10 PM, Anand Jain wrote:
 In the test case provided earlier who is triggering the scan ? 
 grub-probe ?

The scan is initiated by udev.  grub-probe only comes into it because
it is looking to /proc/mounts to find out what device is mounted, and
/proc/mounts is lieing.

 But we had to revert, Since btrfs bug become a feature for the
 system boot process and fixing that breaks mount at boot with
 subvol.

How is this?  Also are we talking about updating the cached list of
devices that *can* be mounted, or what device already *is* mounted?  I
can see doing the former, but the latter should never happen.

 if the device is already mounted, just the device path is updated 
 but still the original device will be still in use (bug).

Yep, that is the bug that started all of this.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUiG1MAAoJENRVrw2cjl5Rm0gIAJ6sq72zKSEfCuCjigknx25T
a97wjtMeb+yeaECc5FfwN7Fm454GSSuj6RFCRVjo3sCgJP3sUEH49syJnvW1QiEP
A5ktXfTpz6/zaeP9DbGPDCiVix0RdsJ6bCjP/8InsASueXOENCpxxmblxrbE4Wxj
Mdz8lu9L8G+fc6btbLLb0N4i0clSiImQds90zTQ1cXihJ/4wUIO3qgq+rruSYMqI
A182FS7NTUQrRcJ/rbcha3dCyD/urbCaRTUztMvTnSs3a7hK5p+SBNbfxEORC6ni
HrRMxpOlgHOTMnL3EHw843OuGv0Us3VqVbuPG3K6L4+G4W1sFxgKEAnLvEbjzAI=
=Vpre
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-09 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/8/2014 5:25 PM, Konstantin wrote:
 
 Phillip Susi schrieb am 08.12.2014 um 15:59:
 The bios does not know or care about partitions.  All you need is
 a
 That's only true for older BIOSs. With current EFI boards they not
 only care but some also mess around with GPT partition tables.

EFI is a whole other beast that we aren't talking about.

 partition table in the MBR and you can install grub there and
 have it boot the system from a mdadm 1.1 or 1.2 format array
 housed in a partition on the rest of the disk.  The only time you
 really *have* to
 I was thinking of this solution as well but as I'm not aware of
 any partitioning tool caring about mdadm metadata so I rejected it.
 It requires a non-standard layout leaving reserved empty spaces for
 mdadm metadata. It's possible but it isn't documented so far I know
 and before losing hours of trying I chose the obvious one.

What on earth are you talking about?  Partitioning tool that cares
about mdadm?  non-standard layout?  I am talking about the bog
standard layout where you create a partition, then use that partition
to build an mdadm array.  mdadm takes care of its own metadata.  There
isn't anything unusual, non obvious, or undocumented here.

 use 0.9 or 1.0 ( and you really should be using 1.0 instead since
 it handles larger arrays and can't be confused vis. whole disk
 vs. partition components ) is if you are running a raid1 on the
 raw disk, with no partition table and then partition inside the
 array instead, and really, you just shouldn't be doing that.
 That's exactly what I want to do - running RAID1 on the whole disk
 as most hardware based RAID systems do. Before that I was running
 RAID on disk partitions for some years but this was quite a pain in
 comparison. Hot(un)plugging a drive brings you a lot of issues with
 failing mdadm commands as they don't like concurrent execution when
 the same physical device is affected. And rebuild of RAID
 partitions is done sequentially with no deterministic order. We
 could talk for hours about that but if interested maybe better in
 private as it is not BTRFS related.

So don't create more than one raid partition on the disk.

 dmraid solves the problem by removing the partitions from the 
 underlying physical device ( /dev/sda ), and only exposing them
 on the array ( /dev/mapper/whatever ).  LVM only has the problem
 when you take a snapshot.  User space tools face the same issue
 and they resolve it by ignoring or deprioritizing the snapshot.
 I don't agree. dmraid and mdraid both remove the partitions. This
 is not a solution BTRFS will still crash the PC using
 /dev/mapper/whatever or whatever device appears in the system
 providing the BTRFS volume.

You just said btrfs will crash by accessing the *correct* volume after
the *incorrect* one has been removed.  You aren't making any sense.
The problem only arises when the same partition is visible on *both*
the raw disk, and the md device.

 Speaking of BTRFS tools, I am still somehow confused that the
 problem confusing or mixing devices happens at all. I don't know
 the metadata of a BTRFS RAID setup but I assume there must be
 something like a drive index in there, as the order of RAID5 drives
 does matter. So having a second device with identical metadata
 should be considered invalid for auto-adding anyway.

Again, the problem is when you first boot up and/or mount the volume.
 Which of the duplicate devices shows up first is indeterminate so
just saying ignore the second one doesn't help.  Even saying well
error out if there are two doesn't help since that leaves open a race
condition where the second volume has not appeared yet at the time you
do the check.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUhx16AAoJENRVrw2cjl5R+IYH/R+ftOiy444+W/K+C0cFKBdi
RlMa2Op9Q0322Rae1IiJvkX/TPUQEnr7sFXcOIhYL9/HKB8zGMr+CQq+9rq8lGdB
QurLcI0MpWbwZZCJCTzrJxRBqqPOXKJ1aU9vWLuuGhS9tCdkfxfy9qcXPnmC2Qta
PfN1Qlr4Invb3Kb/NuB2w7S4nhzYLgBa1KgBDm3EWdCzG03WHMAxwSiBgMvf3nzc
DJ/JMF5TP70760yrlWCvFIa1fgWbGVp7fT9yArDab8N53FYAuE8WIunn+g1hHyue
MTF5ZPhEjVKUVHY1Tl1dqdv0i35TXCbXiVwCwk02veV2+lf95zeNcynmB9kUiSc=
=gvB2
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-08 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/7/2014 7:32 PM, Konstantin wrote:
 I'm guessing you are using metadata format 0.9 or 1.0, which put
 the metadata at the end of the drive and the filesystem still
 starts in sector zero.  1.2 is now the default and would not have
 this problem as its metadata is at the start of the disk ( well,
 4k from the start ) and the fs starts further down.
 I know this and I'm using 0.9 on purpose. I need to boot from
 these disks so I can't use 1.2 format as the BIOS wouldn't
 recognize the partitions. Having an additional non-RAID disk for
 booting introduces a single point of failure which contrary to the
 idea of RAID0.

The bios does not know or care about partitions.  All you need is a
partition table in the MBR and you can install grub there and have it
boot the system from a mdadm 1.1 or 1.2 format array housed in a
partition on the rest of the disk.  The only time you really *have* to
use 0.9 or 1.0 ( and you really should be using 1.0 instead since it
handles larger arrays and can't be confused vis. whole disk vs.
partition components ) is if you are running a raid1 on the raw disk,
with no partition table and then partition inside the array instead,
and really, you just shouldn't be doing that.

 Anyway, to avoid a futile discussion, mdraid and its format is not
 the problem, it is just an example of the problem. Using dm-raid
 would do the same trouble, LVM apparently, too. I could think of a
 bunch of other cases including the use of hardware based RAID
 controllers. OK, it's not the majority's problem, but that's not
 the argument to keep a bug/flaw capable of crashing your system.

dmraid solves the problem by removing the partitions from the
underlying physical device ( /dev/sda ), and only exposing them on the
array ( /dev/mapper/whatever ).  LVM only has the problem when you
take a snapshot.  User space tools face the same issue and they
resolve it by ignoring or deprioritizing the snapshot.

 As it is a nice feature that the kernel apparently scans for drives
 and automatically identifies BTRFS ones, it seems to me that this
 feature is useless. When in a live system a BTRFS RAID disk fails,
 it is not sufficient to hot-replace it, the kernel will not
 automatically rebalance. Commands are still needed for the task as
 are with mdraid. So the only point I can see at the moment where
 this auto-detect feature makes sense is when mounting the device
 for the first time. If I remember the documentation correctly, you
 mount one of the RAID devices and the others are automagically
 attached as well. But outside of the mount process, what is this
 auto-detect used for?
 
 So here a couple of rather simple solutions which, as far as I can
 see, could solve the problem:
 
 1. Limit the auto-detect to the mount process and don't do it when 
 devices are appearing.
 
 2. When a BTRFS device is detected and its metadata is identical to
 one already mounted, just ignore it.

That doesn't really solve the problem since you can still pick the
wrong one to mount in the first place.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUhbztAAoJENRVrw2cjl5RomkH/26Q3M6LXVaF0qEcEzFTzGEL
uVAOKBY040Ui5bSK0WQYnH0XtE8vlpLSFHxrRa7Ygpr3jhffSsu6ZsmbOclK64ZA
Z8rNEmRFhOxtFYTcQwcUbeBtXEN3k/5H49JxbjUDItnVPBoeK3n7XG4i1Lap5IdY
GXyLbh7ogqd/p+wX6Om20NkJSx4xzyU85E4ZvDADQA+2RIBaXva5tDPx5/UD4XBQ
h8ai+wS1iC8EySKxwKBEwzwb7+Z6w7nOWO93v/lL34fwTg0OIY9uEfTaAy5KcDjz
z6QXWTmvrbiFpyy/qyGSqBGlPjZ+r98mVEDbYWCVfK8AoD6UmteD7R8WAWkWiWY=
=PJww
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2][BTRFS-PROGS] Don't use LVM snapshot device

2014-12-08 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/4/2014 1:39 PM, Goffredo Baroncelli wrote:
 LVM snapshots are a problem for the btrfs devices management. BTRFS
 assumes that each device have an unique 'device UUID'. A LVM
 snapshot breaks this assumption.
 
 This causes a lot of problems if some btrfs device are
 snapshotted: - the set of devices for a btrfs multi-volume
 filesystem may be mixed (i.e. some NON snapshotted device with some
 snapshotted devices) - /proc/mount may returns a wrong device.
 
 In the mailing list some posts reported these incidents.
 
 This patch allows btrfs to skip LVM snapshot during the device scan
  phase.
 
 But if you need to consider a LVM snapshot you can set the 
 environment variable BTRFS_SKIP_LVM_SNAPSHOT to no. In this case 
 the old behavior is applied.
 
 To check if a device is a LVM snapshot, it is checked the 'udev' 
 device property 'DM_UDEV_LOW_PRIORITY_FLAG' . If it is set to 1, 
 the device has to be skipped.
 
 As consequence, btrfs now depends also by the libudev.

Rather than modify btrfs device scan to link to libudev and ignore the
caller when commanded to scan a snapshot, wouldn't it be
simpler/better to just fix the udev rule to not *call* btrfs device
scan on the snapshot?


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUhcQyAAoJENRVrw2cjl5R1YAIAJlCine4apKnM+01Tw6JpofZ
447+FVizi5SjdgkcjcREyU5zu1pa7ioOTdExF1v1irN1xMUrRBL/RJcRjjnjkvjB
dP8JU0x52MEvQABzQP9ANWJnkMqUJ0j+ryPn+3B7wLP/RtAnIn2P9Vh1EhiLkZ9N
TdxZIPtPROWPTFBl9ONTBghOHjWYEtcDMkuTS6ZhwLh5c1LE8d3A9c68ez++oSGz
TbS51ITFZCEUyF7E/r/xWHhrYagoRM+xdYqVACpi5eY8rFKl3oH4R96gBK8hNdiN
AIOilSsNFscXiflORMAaRquW/7tUolfNt+3TfzTYmaVnK4Hv5h0wiJjiKJhNgDY=
=HlmL
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2][BTRFS-PROGS] Don't use LVM snapshot device

2014-12-08 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/8/2014 12:36 PM, Goffredo Baroncelli wrote:
 I like this approach, but as I wrote before, it seems that 
 initramfs executes a btrfs dev scan (see my previoue email 
 'Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their
 snapshots' date 2014/12/03 9:34):
 
 $ grep -r btrfs dev.*scan /usr/share/initramfs-tools/ 
 /usr/share/initramfs-tools/scripts/local-premount/btrfs:
 /sbin/btrfs device scan 2 /dev/null
 
 (this is from a debian). However it has to be pointed out that 
 fedora doesn't seems to do the same...

Need to fix that initramfs script then.  On the other hand, if one
*does* run a scan with no arguments, then it probably is a good idea
to ignore snapshots.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUhesjAAoJENRVrw2cjl5RPRAIALK5ERJqRJLSsa6kBSIP8bWe
WP531ew49I0Tkc0o3YOCqq07tb4kZ5rsLsPaLE+s3adCe5/wYzQOox4x6ucak1gK
0igazFx9TYM65YRtFzIUAnj/CPN4WwIInwoAac4w2qwCKB56WUbSU60lEsOmFfRr
6m9EUYkBtMRiWfW2jjuj8iLnBW6glexAqTpW1eKWPfF0AGoUXc8AQboNwceFnHi3
vcjmQM6mhL5zH+FJ0Z/meTk/PwVdjEVJQIEcbMpvggAJeqxsm90GHVIsn8C7B80i
GcX8GHe+Gw3WJMsaW49slKa+MOjWt2SumN/lrKFYPVwQUguhvg0hC1UG5m3cJFo=
=64dH
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-03 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On 12/03/2014 03:24 AM, Goffredo Baroncelli wrote:
 I am thinking about that. Today the device discovery happens: a)
 when a device appears, two udev rules run btrfs dev scan
 device
 
 /lib/udev/rules.d/70-btrfs.rules 
 /lib/udev/rules.d/80-btrfs-lvm.rules
 
 b) during the boot it is ran a btrfs device scan, which scan all
  the device (this happens in debian for other distros may be
 different)
 
 c) after a btrfs.mkfs, which starts a device scan on each devices
 of the new filesystem
 
 d) by the user

Are you sure the kernel only gains awareness of btrfs volumes when
user space runs btrfs device scan?  If that is so then that means you
can not boot from a multi device btrfs root without using an
initramfs.  I thought the kernel auto scanned all devices if you tried
to mount a multi device volume, but if this is so, then yes, the udev
rules could be fixed to not call btrfs device scan on an lvm snapshot.


-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBCgAGBQJUf9BpAAoJENRVrw2cjl5RgcQIALCGfplK/xgX/QaiRjNW96l2
DWNPQMIhPesci0gF7Th3sNboew0hrc3g6S0a55wAO12CBhMPdzHxHjd9iFVpKi9O
vzvU36XyzwdcPJkBqRdPJMT2kX+428gYUW7jkyC8usj5eSCyeiIodJuxirGDL5Nb
3TttEJOpbPHGlTzHjAqEcK2ybzYi9HCN3CD3fuLagP9n+4zmFE7tGaGglZ9+7P58
wZjlP5xKDCR4Cu5Hr+5ErrmT2EoOvFC+PLKOT8xXhD9Y2emk2AtuY+5l/w7I+SIS
42gTUqPOx/8AOxBhOhkI0pPO8eK7S/lP1LKoXF0WWHhX8CgJLIHwj5KniDYcjBA=
=HI90
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-02 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/2/2014 7:23 AM, Austin S Hemmelgarn wrote:
 Stupid thought, why don't we just add blacklisting based on device
 path like LVM has for pvscan?

That isn't logic that belongs in the kernel, so that is going down the
path of yanking out the device auto probing from btrfs and instead
writing a mount.btrfs helper that can use policies like blacklisting
to auto locate all of the correct devices and pass them all to the
kernel at mount time.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUfg7lAAoJENRVrw2cjl5RAakIAKLsIKgjzUO8J/PBBDTmcCQh
IvkEMlQ6ME+Zi7xCKM9p+J5Skcu22zj8w2Ip0s/zNo3ydGorajxehUqtU983l5Hd
VklKOuNGZ0wrOtwCH8IkRt9HUvT3I7982jByi2Uk9jxpRbL/BruaJ4NF+Z9HnvHO
cmMNavcKvwOkYpPHPPbeyjNwWALe/WRZZ2cgsKqs/vB2nakxFntUc1UOsnIMfLJ7
dMF0l9GudoIoNaqRUNoxV1/Lh9MxKx0p9mBK6Pc+V+wLulUyOUSQ6OkUTsznCabk
iUyzX9IYiF83hWO3g+1vxR+GCeYNVGvC/Rj8ZkLSt9Tpi7JH0kbXnq6wKedSfE0=
=Lxfb
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-02 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/2/2014 6:54 AM, Anand Jain wrote:
 we have some fundamentally wrong stuff. My original patch tried to
 fix it. But later discovered that some external entities like 
 systmed and boot process is using that bug as a feature and we had 
 to revert the patch.

If systemd is depending on the kernel lieing about what device it has
mounted then something is *extremely* broken there and that should be
fixed instead of breaking the kernel.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUfg+BAAoJENRVrw2cjl5REm8H/j2MEbF2yeTsGtOGhszl82rZ
ngSvVfEEPq1D+tpi28+oZnSLYxIKEGudqTciyeb8Z1jCTD065D/T0xpGJZyd6pUG
KGahBpnPvhP5xg4RaoSxSzNcFzPPFfz+EIPyV+l3OlHbyeq0whkKj5OAq15Grz6c
RDWViqRFRE+dC2k70fAt6mlxWs7ChCVs9fPuuWVTFW+lXBoCKUZhnZ5Kc2orsKx6
rVTNTo6LxZQX7m+9WzIy5lqH+WgqxtfEacAlM/6jXWwPe09DDT3z0s3ogf+dfO0D
3/efDv1XJ/LwmbyQrGxiS0LQWoPA+d+MX0Od3XRcaeml3d7k/tZjDsrFOY6anIg=
=Rxh6
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-02 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/1/2014 4:45 PM, Konstantin wrote:
 The bug appears also when using mdadm RAID1 - when one of the
 drives is detached from the array then the OS discovers it and
 after a while (not directly, it takes several minutes) it appears
 under /proc/mounts: instead of /dev/md0p1 I see there /dev/sdb1.
 And usually after some hour or so (depending on system workload)
 the PC completely freezes. So discussion about the uniqueness of
 UUIDs or not, a crashing kernel is telling me that there is a
 serious bug.

I'm guessing you are using metadata format 0.9 or 1.0, which put the
metadata at the end of the drive and the filesystem still starts in
sector zero.  1.2 is now the default and would not have this problem
as its metadata is at the start of the disk ( well, 4k from the start
) and the fs starts further down.



-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUfhC6AAoJENRVrw2cjl5RQ2EH/0Z0iCFjOs3e5oGuGqT5Wtlc
rXV8R1EfGSxESK0g6QAe7QIvJu+0CdIgccDp8z3ezfPcm1/YRfBXxXA/Y1Wl4hqw
0wuk3bNqMjUmNwIFjEZCkgOSn4Whuppbh3hOOVGNropr4cwd84GP1Cr2vrzwYnkm
If1I3RTaBhAJRSngkP9X+L5J6zBBjaZLlF4AjC/WP/1bd5vkHpGqnFpRTquCPiNV
9LFWQIB+xYdoRdK2l7huS2jQ5kfw+qLZUQO17dU3fcicwwNk56V4HcLEPg9nx9es
pxJo9BAWmQXDpeMcCL4eFECoeAhn0IXoaXb363mmpq11qyYj73r3FzhNQ+ALzPY=
=U65Z
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-12-01 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/25/2014 6:13 PM, Chris Murphy wrote:
 The drive will only issue a read error when its ECC absolutely
 cannot recover the data, hard fail.
 
 A few years ago companies including Western Digital started
 shipping large cheap drives, think of the green drives. These had
 very high TLER (Time Limited Error Recovery) settings, a.k.a. SCT
 ERC. Later they completely took out the ability to configure this
 error recovery timing so you only get the upward of 2 minutes to
 actually get a read error reported by the drive. Presumably if the
 ECC determines it's a hard fail and no point in reading the same
 sector 14000 times, it would issue a read error much sooner. But
 again, the linux-raid list if full of cases where this doesn't
 happen, and merely by changing the linux SCSI command timer from 30
 to 121 seconds, now the drive reports an explicit read error with
 LBA information included, and now md can correct the problem.

I have one of those and took it out of service when it started reporting
read errors ( not timeouts ).  I tried several times to write over the
bad sectors to force reallocation and it worked again for a while...
then the bad sectors kept coming back.  Oddly, the SMART values never
indicated anything had been reallocated.

 That's my whole point. When the link is reset, no read error is 
 submitted by the drive, the md driver has no idea what the drive's 
 problem was, no idea that it's a read problem, no idea what LBA is 
 affected, and thus no way of writing over the affected bad sector.
 If the SCSI command timer is raised well above 30 seconds, this
 problem is resolved. Also replacing the drive with one that
 definitively errors out (or can be configured with smartctl -l
 scterc) before 30 seconds is another option.

It doesn't know why or exactly where, but it does know *something* went
wrong.

 It doesn't really matter, clearly its time out for drive commands
 is much higher than the linux default of 30 seconds.

Only if you are running linux and can see the timeouts.  You can't
assume that's what is going on under windows just because the desktop
stutters.

 OK that doesn't actually happen and it would be completely f'n
 wrong behavior if it were happening. All the kernel knows is the
 command timer has expired, it doesn't know why the drive isn't
 responding. It doesn't know there are uncorrectable sector errors
 causing the problem. To just assume link resets are the same thing
 as bad sectors and to just wholesale start writing possibly a
 metric shit ton of data when you don't know what the problem is
 would be asinine. It might even be sabotage. Jesus...

In normal single disk operation sure: the kernel resets the drive and
retries the request.  But like I said before, I could have sworn there
was an early failure flag that md uses to tell the lower layers NOT to
attempt that kind of normal recovery, and instead just to return the
failure right away so md can just go grab the data from the drive that
isn't wigging out.  That prevents the system from stalling on paging IO
while the drive plays around with its deep recovery, and copying back
512k to the drive with the one bad sector isn't really that big of a
deal.

 Then there is one option which is to increase the value of the
 SCSI command timer. And that applies to all raid: md, lvm, btrfs,
 and hardware.

And then you get stupid hanging when you could just get the data from
the other drive immediately.
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUfL04AAoJENRVrw2cjl5RFW0H/Rtz4Y8bynWAP2yjiqZMsic+
vXCxuJAFGpOKVyV1FboCuLStp8TQ5aIiJyHrprsCiy4UAY0bFQjzaHOo4jBlCdV/
YaD3HSWGKAFUbIiByCnMfIDMxWSPP8rOeFpotoywAkNe0vIsIKg955IX96+jNMy2
IAjKGQahzp2UW6ggnwwdA/JayUmb1jZ8LvmV58rDVdhTnGPgrrYZnIyf/OphrXqd
R/WJtFDuUBUhtsmXYrY2wGUQNi+3zp+I9YburmeDtEcrbwDLDCiVdE6ChmoCrNBS
nbcfqoWPEk1DsiI9GC/Yu/sXLq2iD0n53e/DHa36z4zc4uWtUjBwSYyCubJfkyI=
=FrB9
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-25 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/19/2014 7:05 PM, Chris Murphy wrote:
 I'm not a hard drive engineer, so I can't argue either point. But 
 consumer drives clearly do behave this way. On Linux, the kernel's 
 default 30 second command timer eventually results in what look
 like link errors rather than drive read errors. And instead of the
 problems being fixed with the normal md and btrfs recovery
 mechanisms, the errors simply get worse and eventually there's data
 loss. Exhibits A, B, C, D - the linux-raid list is full to the brim
 of such reports and their solution.

I have seen plenty of error logs of people with drives that do
properly give up and return an error instead of timing out so I get
the feeling that most drives are properly behaved.  Is there a
particular make/model of drive that is known to exhibit this silly
behavior?

 IIRC, this is true when the drive returns failure as well.  The
 whole bio is marked as failed, and the page cache layer then
 begins retrying with progressively smaller requests to see if it
 can get *some* data out.
 
 Well that's very course. It's not at a sector level, so as long as
 the drive continues to try to read from a particular LBA, but fails
 to either succeed reading or give up and report a read error,
 within 30 seconds, then you just get a bunch of wonky system
 behavior.

I don't understand this response at all.  The drive isn't going to
keep trying to read the same bad lba; after the kernel times out, it
resets the drive, and tries reading different smaller parts to see
which it can read and which it can't.

 Conversely what I've observed on Windows in such a case, is it 
 tolerates these deep recoveries on consumer drives. So they just
 get really slow but the drive does seem to eventually recover
 (until it doesn't). But yeah 2 minutes is a long time. So then the
 user gets annoyed and reinstalls their system. Since that means
 writing to the affected drive, the firmware logic causes bad
 sectors to be dereferenced when the write error is persistent.
 Problem solved, faster system.

That seems like rather unsubstantiated guesswork.  i.e. the 2 minute+
delays are likely not on an individual request, but from several
requests that each go into deep recovery, possibly because windows is
retrying the same sector or a few consecutive sectors are bad.

 Because now you have a member drive that's inconsistent. At least
 in the md raid case, a certain number of read failures causes the
 drive to be ejected from the array. Anytime there's a write
 failure, it's ejected from the array too. What you want is for the
 drive to give up sooner with an explicit read error, so md can help
 fix the problem by writing good data to the effected LBA. That
 doesn't happen when there are a bunch of link resets happening.

What?  It is no different than when it does return an error, with the
exception that the error is incorrectly applied to the entire request
instead of just the affected sector.

 Again, if your drive SCT ERC is configurable, and set to something 
 sane like 70 deciseconds, that read failure happens at MOST 7
 seconds after the read attempt. And md is notified of *exactly*
 what sectors are affected, it immediately goes to mirror data, or
 rebuilds it from parity, and then writes the correct data to the
 previously reported bad sectors. And that will fix the problem.

Yes... I'm talking about when the drive doesn't support that.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUdPXRAAoJEI5FoCIzSKrw5aUIAJpmAczzc+0flGpDnenNIf9E
HITY2a15lRhrnpfiEBmlTe0EUyc8O+Sv/kWJ61VRJ1KNCtF0Cs0jMEvOk2BGiM9T
rR2KinIFlPZfuR7sUpgns+i5TK3eXpn+bbm5jIUFf8hOdkERFArwaQIqo3qqMybs
3rHdnBo7T+F9oCMwuFyvwHupDd2gCbnibB8mIUhijUcZQwoqU9c/ISGySpM7x04J
VeDCI3hWv2V5hhm+Bfdq3fQpjeIo2AAvCPt+ODuFFHabQ5l78Qu7IlCEFGIYuQqi
VJPxXNUi4n34O/jWEX5KBGgXp3H1RegnvcAt2NFLMVpFVDSB9I5eYLrj/d8KWoE=
=r3AP
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-25 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/19/2014 6:59 PM, Duncan wrote:
 It's not physical spinup, but electronic device-ready.  It happens
 on SSDs too and they don't have anything to spinup.

If you have an SSD that isn't handling IO within 5 seconds or so of
power on, it is badly broken.

 But, for instance on my old seagate 300-gigs that I used to have in
 4-way mdraid, when I tried to resume from hibernate the drives
 would be spunup and talking to the kernel, but for some seconds to
 a couple minutes or so after spinup, they'd sometimes return
 something like (example) Seagrte3x0 instead of Seagate300.  Of
 course that wasn't the exact string, I think it was the model
 number or perhaps the serial number or something, but looking at
 dmsg I could see the ATA layer up for each of the four devices, the
 connection establish and seem to be returning good data, then the
 mdraid layer would try to assemble and would kick out a drive or
 two due to the device string mismatch compared to what was there 
 before the hibernate.  With the string mismatch, from its
 perspective the device had disappeared and been replaced with
 something else.

Again, these drives were badly broken then.  Even if it needs extra
time to come up for some reason, it shouldn't be reporting that it is
ready and returning incorrect information.

 And now I seen similar behavior resuming from suspend (the old
 hardware wouldn't resume from suspend to ram, only hibernate, the
 new hardware resumes from suspend to ram just fine, but I had
 trouble getting it to resume from hibernate back when I first setup
 and tried it; I've not tried hibernate since and didn't even setup
 swap to hibernate to when I got the SSDs so I've not tried it for a
 couple years) on SSDs with btrfs raid.  Btrfs isn't as informative
 as was mdraid on why it kicks a device, but dmesg says both devices
 are up, while btrfs is suddenly spitting errors on one device.  A
 reboot later and both devices are back in the btrfs and I can do a
 scrub to resync, which generally finds and fixes errors on the
 btrfs that were writable (/home and /var/log), but of course not on
 the btrfs mounted as root, since it's read-only by default.

Several months back I was working on some patches to avoid blocking a
resume until after all disks had spun up ( someone else ended up
getting a different version merged to the mainline kernel ).  I looked
quite hard at the timings of things during suspend and found that my
ssd was ready and handling IO darn near instantly and the hd ( 5900
rpm wd green at the time ) took something like 7 seconds before it was
completing IO.  These days I'm running a raid10 on 3 7200 rpm blues
and it comes right up from suspend with no problems, just as it should.

 The paper specifically mentioned that it wasn't necessarily the
 more expensive devices that were the best, either, but the ones
 that faired best did tend to have longer device-ready times.  The
 conclusion was that a lot of devices are cutting corners on
 device-ready, gambling that in normal use they'll work fine,
 leading to an acceptable return rate, and evidently, the gamble
 pays off most of the time.

I believe I read the same study and don't recall any such conclusion.
 Instead the conclusion was that the badly behaving drives aren't
ordering their internal writes correctly and flushing their metadata
from ram to flash before completing the write request.  The problem
was on the power *loss* side, not the power application.

 The spinning rust in that study faired far better, with I think
 none of the devices scrambling their own firmware, and while there
 was some damage to storage, it was generally far better confined.

That is because they don't have a flash translation layer to get
mucked up and prevent them from knowing where the blocks are on disk.
 The worst thing you get out of a hdd losing power during a write is
the sector it was writing is corrupted and you have to re-write it.

 My experience says otherwise.  Else explain why those problems
 occur in the first two minutes, but don't occur if I hold it at the
 grub prompt to stabilizefor two minutes, and never during normal
 post- stabilization operation.  Of course perhaps there's another
 explanation for that, and I'm conflating the two things.  But so
 far, experience matches the theory.

I don't know what was broken about these drives, only that it wasn't
capacitors since those charge in milliseconds, not seconds.  Further,
all systems using microprocessors ( like the one in the drive that
controls it ) have reset circuitry that prevents them from running
until after any caps have charged enough to get the power rail up to
the required voltage.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUdP9jAAoJEI5FoCIzSKrw50IH/jkh48Z8Oh/AS/i68zT6Grtb
C98aNNQwhC2sJSvaxRBqJ1qkXY4af5DZM/SOvFdNE4qdPLBDLfg70tnTXwU4PjzN
1mHR1PR6Vgft11t0+u8TPTos669Jm8KJ21NMgY072P18Kj/+UJqNRQ+UUNikAcaM

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-22 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On 11/21/2014 04:12 PM, Robert White wrote:
 Here's a bug from 2005 of someone having a problem with the ACPI
 IDE support...

That is not ACPI emulation.  ACPI is not used to access the disk,
but rather it has hooks that give it a chance to diddle with the disk
to do things like configure it to lie about its maximum size, or issue
a security unlock during suspend/resume.

 People debating the merits of the ACPI IDE drivers in 2005.

No... that's not a debate at all; it is one guy asking if he should
use IDE or ACPI mode... someone who again meant AHCI and typed the
wrong acronym.

 Even when you get me for referencing windows, you're still 
 wrong...
 
 How many times will you try get out of being hideously horribly
 wrong about ACPI supporting disk/storage IO? It is neither recent
 nor rare.
 
 How much egg does your face really need before you just see that
 your fantasy that it's new and uncommon is a delusional mistake?

Project much?

It seems I've proven just about everything I originally said you got
wrong now so hopefully we can be done.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBCgAGBQJUcQj4AAoJENRVrw2cjl5RwmcH+gOW0LUQE4OXEToMY33brK8Z
QMKw7T1y4dtXIeeWihugNs+vbwmoI2Wheeej4WPdiqvgqIfX4ov9+N9Nb39JiIsI
7frPJ638n98Et5sirCGKfaVvDTwlF85ApHHtXrVLg2dBY3A+oLM9jVU7jpRBvW1m
IFjhJH/SMGDpMhix9SFg6w6cALRh1U5WYV4zMZ1f5/ri/05TYmNJ/M23cjtBicPZ
LaIFxOMGef4lylysNaVh0W03424oIJit6d7DB1gxCyjnkUvVuJ43NjuS5ay+y2sP
FFrepKrOfhK1oOib9e63zNfRHhWrX4KN0Dqcu/3+/+lhD3q5G1fd4YK2RV/oaso=
=nm9l
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/20/2014 5:45 PM, Robert White wrote:
 Nice attempt at saving face, but wrong as _always_.
 
 The CONFIG_PATA_ACPI option has been in the kernel since 2008 and
 lots of people have used it.
 
 If you search for ACPI ide you'll find people complaining in
 2008-2010 about windows error messages indicating the device is
 present in their system but no OS driver is available.

Nope... not finding it.  The closest thing was one or two people who
said ACPI when they meant AHCI ( and were quickly corrected ).  This
is probably what you were thinking of since windows xp did not ship
with an ahci driver so it was quite common for winxp users to have
this problem when in _AHCI_ mode.

 That you have yet to see a single system that implements it is
 about the worst piece of internet research I've ever seen. Do you
 not _get_ that your opinion about what exists and how it works is
 not authoritative?

Show me one and I'll give you a cookie.  I have disassembled a number
of acpi tables and yet to see one that has it.  What's more,
motherboard vendors tend to implement only the absolute minimum they
have to.  Since nobody actually needs this feature, they aren't going
to bother with it.  Do you not get that your hand waving arguments of
you can google for it are not authoritative?

 You can also find articles about both windows and linux systems
 actively using ACPI fan control going back to 2009

Maybe you should have actually read those articles.  Linux supports
acpi fan control, unfortunately, almost no motherboards actually
implement it.  Almost everyone who wants fan control working in linux
has to install lm-sensors and load a driver that directly accesses one
of the embedded controllers that motherboards tend to use and run the
fancontrol script to manipulate the pwm channels on that controller.
These days you also have to boot with a kernel argument to allow
loading the driver since ACPI claims those IO ports for its own use
which creates a conflict.

Windows users that want to do this have to install a program... I
believe a popular one is called q-fan, that likewise directly accesses
the embedded controller registers to control the fan, since the acpi
tables don't bother properly implementing the acpi fan spec.

Then there are thinkpads, and one or two other laptops ( asus comes to
mind ) that went and implemented their own proprietary acpi interfaces
for fancontrol instead of following the spec, which required some
reverse engineering and yet more drivers to handle these proprietary
acpi interfaces.  You can google for thinkfan if you want to see this.

 These are not hard searches to pull off. These are not obscure 
 references. Go to the google box and start typing ACPI fan...
 and check the autocomplete.
 
 I'll skip ovea all the parts where you don't know how a chipset
 works and blah, blah, blah...
 
 You really should have just stopped at I don't know and I've
 never because you keep demonstrating that you _don't_ know, and
 that you really _should_ _never_.
 
 Tell us more about the lizard aliens controlling your computer, I
 find your versions of realty fascinating...

By all means, keep embarrassing yourself with nonsense and trying to
cover it up by being rude and insulting.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUb1YxAAoJEI5FoCIzSKrwi54H/Rkd7DloqC9x9QwN4QdmWcAZ
/UQg3hcRbtB3wpmp34Mnb3SS0Ii2mCh/dtKmdRGBNE/x5nU1WiQEHHCicKX3Avvq
8OXLNQrsf+xZL9/HGtUJ3RefpEkmwIG5NgFfKJHtv6Iq204Umq32JUxRla+ZQE5s
MrUparigpUlj26lrnShc6ByDUqYK3wOTsDxEMxrOyAgi/n/7ESHV/dZVaqsE6jGQ
OvPynf1FqJoJSSYC7sNE0XLqfHMu2wnSxcoF6MpuHXlDiwtrSH07tuwgrhCNPagY
j7gQyxucew8oim8lcfs+4rrQ60wwVzlsEJwjA9rAXQF7U2x/WoB+ArYhgmJUMgA=
=cXJr
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/20/2014 6:08 PM, Robert White wrote:
 Well you should have _actually_ trimmed your response down to not 
 pressing send.
 
 _Many_ motherboards have complete RAID support at levels 0, 1, 10,
 and five 5. A few have RAID6.
 
 Some of them even use the LSI chip-set.

Yes, there are some expensive server class motherboards out there with
integrated real raid chips.  Your average consumer class motherboards
are not those.  They contain intel, nvidia, sil, promise, and via
chipsets that are fake raid.

 Seriously... are you trolling this list with disinformation or
 just repeating tribal knowledge from fifteen year old copies of PC
 Magazine?

Please drop the penis measuring.

 Yea, some of the IDE motherboards and that only had RAID1 and RAID0
 (and indeed some of the add-on controllers) back in the IDE-only
 days were really lame just-forked-write devices with no integrity
 checks (hence fake raid) but that's from like the 1990s; it's
 paleolithic age wisdom at this point.

Wrong again... fakeraid became popular with the advent of SATA since
it was easy to add a knob to the bios to switch it between AHCI and
RAID mode, and just change the pci device id.  These chipsets are
still quite common today and several of them do support raid5 and
raid10 ( well, really it's raid 0 + raid1, but that's a whole nother
can of worms ).  Recent intel chips also now have a caching mode for
having an SSD cache a larger HDD.  Intel has also done a lot of work
integrating support for their chipset into mdadm in the last year or
three.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUb1ngAAoJEI5FoCIzSKrwqMQIAJ3MfA4n74aJ1KUdfHYOz96o
vwPBNQJ953yozmCHfjERbTCQlKT5AzwQHWpHoFWsQ4gYoNGmeE1jy2rsqxMfujff
eQekfISyX3POExnsr3LnfHWI2/Om39+EAxVPxbA5LN6SC1SCWRut7Q3bQqkuxj/S
bYRU65XJ9BZ6eYznutMDFdEELyAr8b9/wnatI/ohzmebOBDgFzBrn8gwilCctz7X
DI39HTkCvciWKVXNyVdUZKI5S+MRCEB2JZAkCy3x8LLsENmMnO0xN32o5Od0zlGn
nFLcLQFrZfz5dY2ZusxP+z0z0x4RW3sikd4RZ99PEHBkFa5CgJIFrBxtQAsLi1c=
=4Yg+
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-20 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/19/2014 5:25 PM, Robert White wrote:
 The controller, the thing that sets the ready bit and sends the 
 interrupt is distinct from the driver, the thing that polls the
 ready bit when the interrupt is sent. At the bus level there are
 fixed delays and retries. Try putting two drives on a pin-select
 IDE bus and strapping them both as _slave_ (or indeed master)
 sometime and watch the shower of fixed delay retries.

No, it does not.  In classical IDE, the controller is really just a
bus bridge.  When you read from the status register in the controller,
the read bus cycle is propagated down the IDE ribbon, and into the
drive, and you are in fact, reading the register directly from the
drive.  That is where the name Integrated Device Electronics came
from: because the controller was really integrated into the drive.

The only fixed delays at the bus level are the bus cycle speed.  There
are no retries.  There are only 3 mentions of the word retry in the
ATA8-APT and they all refer to the host driver.

 That's odd... my bios reads from storage to boot the device and it
 does so using the ACPI storage methods.

No, it doesn't.  It does so by accessing the IDE or ACHI registers
just as pc bios always has.  I suppose I also need to remind you that
we are talking about the context of linux here, and linux does not
make use of the bios for disk access.

 ACPI 4.0 Specification Section 9.8 even disagrees with you at some
 length.
 
 Let's just do the titles shall we:
 
 9.8 ATA Controller Devices 9.8.1 Objects for both ATA and SATA
 Controllers. 9.8.2 IDE Controller Device 9.8.3 Serial ATA (SATA)
 controller Device
 
 Oh, and _lookie_ _here_ in Linux Kernel Menuconfig at Device
 Drivers - * Serial ATA and Parallel ATA drivers (libata) - *
 ACPI firmware driver for PATA
 
 CONFIG_PATA_ACPI:
 
 This option enables an ACPI method driver which drives motherboard
 PATA controller interfaces through the ACPI firmware in the BIOS.
 This driver can sometimes handle otherwise unsupported hardware.
 
 You are a storage _genius_ for knowing that all that stuff doesn't 
 exist... the rest of us must simply muddle along in our
 delusion...

Yes, ACPI 4.0 added this mess.  I have yet to see a single system that
actually implements it.  I can't believe they even bothered adding
this driver to the kernel.  Is there anyone in the world who has ever
used it?  If no motherboard vendor has bothered implementing the ACPI
FAN specs, I very much doubt anyone will ever bother with this.

 Do tell us more... I didn't say the driver would cause long delays,
 I said that the time it takes to error out other improperly
 supported drivers and fall back to this one could induce long
 delays and resets.

There is no error out and fall back.  If the device is in AHCI
mode then it identifies itself as such and the ACHI driver is loaded.
 If it is in IDE mode, then it identifies itself as such, and the IDE
driver is loaded.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUbk5qAAoJEI5FoCIzSKrw++IH/2DAayNzDqKlA7DBi79UVlpg
jJHDOlmzPqJCLMkffZRX1TLM/OEzu3k/pYMlS0HCdNggbG7eTpHxsoCetiETPcnc
LCcolWXa/eMfzkEphSq4GToeEj5FKrVNzymNvPVL6zdiSfySvSg4RZOs123ULYNM
nPUaOYPSiDPzfC7ggUS3RSvWb8mNzfRVJtgGXlZd/jDh+NAjy3oTb4fYksZjq8qb
n5emKU1jJafvSbBek41wo7Xji1vLThiDZ4kcf4c7oT3x4WuQUMUhzkficqEnwYsm
HK12pv0ktDJr6hKMcHPT26YKsdUOPE6XC3GgNaxt8EZ3bioWYRb4RRAdAuAjI2s=
=+M2o
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-20 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/19/2014 5:33 PM, Robert White wrote:
 That would be fake raid, not hardware raid.
 
 The LSI MegaRaid controller people would _love_ to hear more about
 your insight into how their battery-backed multi-drive RAID
 controller is fake. You should go work for them. Try the contact
 us link at the bottom of this page. I'm sure they are waiting for
 your insight with baited breath!

Forgive me, I should have trimmed the quote a bit more.  I was
responding specifically to the many mother boards have hardware RAID
support available through the bios part, not the lsi part.

 Odd, my MegaRaid controller takes about fifteen seconds
 by-the-clock to initialize and to the integrity check on my single
 initialized drive.

It is almost certainly spending those 15 seconds on something else,
like bootstrapping its firmware code from a slow serial eeprom or
waiting for you to press the magic key to enter the bios utility.  I
would be very surprised to see that time double if you add a second
disk.  If it does, then they are doing something *very* wrong, and
certainly quite different from any other real or fake raid controller
I've ever used.

 It's amazing that with a fail and retry it would be _faster_...

I have no idea what you are talking about here.  I said that they
aren't going to retry a read that *succeeded* but came back without
their magic signature.  It isn't like reading it again is going to
magically give different results.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUblBmAAoJEI5FoCIzSKrwFKkIAKNGOGyLrMIcTeV4DQntdbaa
NMkjXnWnk6lHeqTyE/pb+l4VgVH8nQwDp8hRCnKNnKHoZbT8LOGFULSmBes+DDmW
dxPVDTytUu1AiqB7AyxNJU8213BQCaF0inL7ofZmX95N+0eajuVxOyHIMeokdwUU
zLOnXQg0awLkQwk7U6YLAKA4A7HrOEXw4wHt9hPy/yUySMVqCeHYV3tpf7t96guU
0IRctvpwcNvvVtt65I8A4EklR+vCvqEDUZfKyG8WJAeyAdC4UoHT9vZcJAVkiFl+
Y+Mp5wsr1vuo3dYQ1bKO8RvPTB9D9npFyFIlyHEBMJlCHDU43YsNP8hGcu0mKco=
=AJ6/
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-19 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/18/2014 9:40 PM, Chris Murphy wrote:
 It’s well known on linux-raid@ that consumer drives have well over
 30 second deep recoveries when they lack SCT command support. The
 WDC and Seagate “green” drives are over 2 minutes apparently. This
 isn’t easy to test because it requires a sector with enough error
 that it requires the ECC to do something, and yet not so much error
 that it gives up in less than 30 seconds. So you have to track down
 a drive model spec document (one of those 100 pagers).
 
 This makes sense, sorta, because the manufacturer use case is 
 typically single drive only, and most proscribe raid5/6 with such 
 products. So it’s a “recover data at all costs” behavior because
 it’s assumed to be the only (immediately) available copy.

It doesn't make sense to me.  If it can't recover the data after one
or two hundred retries in one or two seconds, it can keep trying until
the cows come home and it just isn't ever going to work.

 I don’t see how that’s possible because anything other than the
 drive explicitly producing  a read error (which includes the
 affected LBA’s), it’s ambiguous what the actual problem is as far
 as the kernel is concerned. It has no way of knowing which of
 possibly dozens of ata commands queued up in the drive have
 actually hung up the drive. It has no idea why the drive is hung up
 as well.

IIRC, this is true when the drive returns failure as well.  The whole
bio is marked as failed, and the page cache layer then begins retrying
with progressively smaller requests to see if it can get *some* data out.

 No I think 30 is pretty sane for servers using SATA drives because
 if the bus is reset all pending commands in the queue get
 obliterated which is worse than just waiting up to 30 seconds. With
 SAS drives maybe less time makes sense. But in either case you
 still need configurable SCT ERC, or it needs to be a sane fixed
 default like 70 deciseconds.

Who cares if multiple commands in the queue are obliterated if they
can all be retried on the other mirror?  Better to fall back to the
other mirror NOW instead of waiting 30 seconds ( or longer! ).  Sure,
you might end up recovering more than you really had to, but that
won't hurt anything.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUbLMyAAoJEI5FoCIzSKrwSM8IAJO2cwhHyxK4LFjINEbNT+ij
fT4EpyzOCs704zhOTgssgSQ8ym85PRQ8VyAIrz338m+lHqKbktZtRt7vWaealmOp
6eleIDJ/I7kggnlhkqg1V8Nctap8qBeRE34K/PaGtTrkRzBYnYxbGdDDz+rXaDi6
CSEMLJBo3I69Oj9qSOV4O18ntV/S3eln0sQ8+w2btbc3xGkG3X2FwVIJokb6IAmu
ngHUeDGXUgkEOvzw3aGDheLueGDPe+V3YlsjSbw2rH75svzXqFCUO8Jcg4NfxT0q
Nl03eoTEGlyf8x2geMWfhoKFatJ7sCMy48K0ZFAAX1k8j0ssjNaEC+q6pwrA/xU=
=Gehg
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS messes up snapshot LV with origin

2014-11-19 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/18/2014 9:54 PM, Chris Murphy wrote:
 Why is it silly? Btrfs on a thin volume has practical use case
 aside from just being thinly provisioned, its snapshots are block
 device based, not merely that of an fs tree.

Umm... because one of the big selling points of btrfs is that it is in
a much better position to make snapshots being aware of the fs tree
rather than doing it in the block layer.

So it is kind of silly in the first place to be using lvm snapshots
under btrfs, but it is is doubly silly to use lvm for snapshots, and
btrfs for the mirroring rather than lvm.  Pick one layer and use it
for both functions.  Even if that is lvm, then it should also be
handling the mirroring.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUbLUxAAoJEI5FoCIzSKrwh0oH/3TZ2oo8u2BjHYO3b0x8800/
LFkmGFWrZFSnAvtWuN5B1WlhMXku4dxLRXz14fJKFp3fNmnYRNVvw3tu9btvsBsC
sZdwLaKwKPHTK8RS+QCI2pZPX+cGB+F7/z9PCHrzIzzCKk/4SvnJ76e2nnZFpY1m
Md3f1BCHEVUPMMXbqv6Ry6v7PDs/8bx8WITYyAL9uh3tjh0dXQsjbZJn5u4XDitS
/CoE8eX4rf1vc7qHI4K56TtArCcXQxAHcC56fXmcmS03bVhAkkJ5Z+/uwi6+TkJe
55rMFCd7UFy9pwKha3Q2flJHtDYG6ns7Njyff6BSL9Yzq7tHh4wLk1H3XxaOCP8=
=ktv/
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-19 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/18/2014 9:46 PM, Duncan wrote:
 I'm not sure about normal operation, but certainly, many drives
 take longer than 30 seconds to stabilize after power-on, and I
 routinely see resets during this time.

As far as I have seen, typical drive spin up time is on the order of
3-7 seconds.  Hell, I remember my pair of first generation seagate
cheetah 15,000 rpm drives seemed to take *forever* to spin up and that
still was maybe only 15 seconds.  If a drive takes longer than 30
seconds, then there is something wrong with it.  I figure there is a
reason why spin up time is tracked by SMART so it seems like long spin
up time is a sign of a sick drive.

 This doesn't happen on single-hardware-device block devices and 
 filesystems because in that case it's either up or down, if the
 device doesn't come up in time the resume simply fails entirely,
 instead of coming up with one or more devices there, but others
 missing as they didn't stabilize in time, as is unfortunately all
 too common in the multi- device scenario.

No, the resume doesn't fail entirely.  The drive is reset, and the
IO request is retried, and by then it should succeed.

 I've seen this with both spinning rust and with SSDs, with mdraid
 and btrfs, with multiple mobos and device controllers, and with
 resume both from suspend to ram (if the machine powers down the
 storage devices in that case, as most modern ones do) and hibernate
 to permanent storage device, over several years worth of kernel
 series, so it's a reasonably widespread phenomena, at least among
 consumer-level SATA devices.  (My experience doesn't extend to
 enterprise-raid-level devices or proper SCSI, etc, so I simply
 don't know, there.)

If you are restoring from hibernation, then the drives are already
spun up before the kernel is loaded.

 While two minutes is getting a bit long, I think it's still within
 normal range, and some devices definitely take over a minute enough
 of the time to be both noticeable and irritating.

It certainly is not normal for a drive to take that long to spin up.
IIRC, the 30 second timeout comes from the ATA specs which state that
it can take up to 30 seconds for a drive to spin up.

 That said, I SHOULD say I'd be far *MORE* irritated if the device
 simply pretended it was stable and started reading/writing data
 before it really had stabilized, particularly with SSDs where that
 sort of behavior has been observed and is known to put some devices
 at risk of complete scrambling of either media or firmware, beyond
 recovery at times.  That of course is the risk of going the other
 direction, and I'd a WHOLE lot rather have devices play it safe for
 another 30 seconds or so after they / think/ they're stable and be
 SURE, than pretend to be just fine when voltages have NOT
 stabilized yet and thus end up scrambling things irrecoverably.
 I've never had that happen here tho I've never stress- tested for
 it, only done normal operation, but I've seen testing reports where
 the testers DID make it happen surprisingly easily, to a surprising
  number of their test devices.

Power supply voltage is stable within milliseconds.  What takes HDDs
time to start up is mechanically bringing the spinning rust up to
speed.  On SSDs, I think you are confusing testing done on power
*cycling* ( i.e. yanking the power cord in the middle of a write )
with startup.

 So, umm... I suspect the 2-minute default is 2 minutes due to
 power-up stabilizing issues, where two minutes is a reasonable
 compromise between failing the boot most of the time if the timeout
 is too low, and taking excessively long for very little further
 gain.

The default is 30 seconds, not 2 minutes.

 sure whether it's even possible, without some specific hardware
 feature available to tell the kernel that it has in fact NOT been
 in power-saving mode for say 5-10 minutes, hopefully long enough
 that voltage readings really /are/ fully stabilized and a shorter
 timeout is possible.

Again, there is no several minute period where voltage stabilizes and
the drive takes longer to access.  This is a complete red herring.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUbMBPAAoJEI5FoCIzSKrwcV0H/20pv7O5+CDf2cRg5G5vt7PR
4J1NuVIBsboKwjwCj8qdxHQJHihvLYkTQKANqaqHv0+wx0u2DaQdPU/LRnqN71xA
jP7b9lx9X6rPnAnZUDBbxzAc8HLeutgQ8YD/WB0sE5IXlI1/XFGW4tXIZ4iYmtN9
GUdL+zcdtEiYE993xiGSMXF4UBrN8d/5buBRsUsPVivAZes6OHbf9bd72c1IXBuS
ADZ7cH7XGmLL3OXA+hm7d99429HFZYAgI7DjrLWp6Tb9ja5Gvhy+AVvrbU5ZWMwu
XUnNsLsBBhEGuZs5xpkotZgaQlmJpw4BFY4BKwC6PL+7ex7ud3hGCGeI6VDmI0U=
=DLHU
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs on a failing drive

2014-11-19 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Again, please stop taking this conversation private; keep the mailing
list on the Cc.

On 11/19/2014 11:37 AM, Fennec Fox wrote:
 well ive used spinrite and its found a few sectors   and they
 never move   so obviously the drives firmware isnt dealing with bad
 blocks on the drive   anyways ive got a new drive on order  but
 what can i do to prevent the drive from killing any more data?

The drive will only remap bad blocks when you try to write to them, so
if you haven't written to them then it is no surprise that they aren't
going anywhere.

If the drive is actually returning bad data rather than failing the
read outright, then the only thing you can do is to have btrfs
duplicate all data so if the checksum on one copy is bad it can try
the other.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUbN8VAAoJEI5FoCIzSKrwGjkIAKxXbBcMaItyBe08yC/bipUH
2crWLj5MKej1sn1HEo1WqgJM1hCEZuHCBa8I6ZIECcZmzs4rvKhzU4WWIQ7J/tMN
8OYUzdsWboxbKHY5hrNEVsi8QcUTbz7HT3doaaYDhI7qERu1Ib/4FH+m5yFYEIu8
tx5+N2PzyXctDlNnjY/pcFg+I2+QyA5Rb9X+fLpvVoZCEW7TTMhejfKSQpMEfzHW
JsYyKwDpQO6cGIWi19P7pgHc2bsCzShPtFo9UQJh5TtuxjsqP01ju1UfQBX0+Y25
B2LDAjyGE71pY68tBuS7EC9XSB9Iks5yEJotmwYTv3/L7bgDeAGPrj5cFOKG9Tc=
=8JoK
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS messes up snapshot LV with origin

2014-11-19 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/19/2014 1:33 PM, Chris Murphy wrote:
 Thin volumes are more efficient. And the user creating them doesn't
 have to mess around with locating physical devices or possibly
 partitioning them. Plus in enterprise environments with lots of
 storage and many different kinds of use cases, even knowledable
 users aren't always granted full access to the physical storage
 anyway. They get a VG to play with, or now they can have a thin
 pool and only consume on storage what is actually used, and not
 what they've reserved. You can mkfs a 4TG virtual size volume, 
 while it only uses 1MB of physical extents on storage. And all of 
 that is orthogonal to using XFS or Btrfs which again comes down to 
 use case. And whether I'd have LVM mirror or Btrfs mirror is again 
 a question of use case, maybe I'm OK with LVM mirroring and I just 
 get the rare corrupt file warning and that's OK. In another use 
 case, corruption isn't OK, I need higher availability of known
 good data therefore I need Btrfs doing the mirroring.

Correct me if I'm wrong, but this kind of setup is basically where you
have a provider running an lvm thin pool volume on their hardware, and
exposing it to the customer's vm as a virtual disk.  In that case,
then the provider can do their snapshots and it won't cause this
problem since the snapshots aren't visible to the vm.  Also in these
cases the provider is normally already providing data protection by
having the vg on a raid6 or raid60 or something, so having the client
vm mirror the data in btrfs is a bit redundant.




-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUbO4nAAoJEI5FoCIzSKrwl/QIAJ7arJ0ZXVc16pBRjE2F66uV
GAOhatdx8pLhGey6by+gV8Ltvx4bK3BG40dkvQIM9RN9UFC5vofQ4FnzIn1nfXZB
qyyITE2mF+lE3RNCb8ZKxwG58rfa9NOModPCeNVFWkS6+fyyhGY23sliWbVO6b15
w6BD5xu/Pp7Fhgkx81AL07XpusR9c8pKZd8ZHw4nozFHw20+13XuL+2g8axpZS+O
Xd9W5GRlC+0k9jQ0q9xGi1jh6QpjMSWVj54MNS5jRubsY65TtmFPkdvgaMGD4U5k
bADSEUMfij9NRMw8VwA4ik/JEi1IbukD4u1geKeZTowMGXReel2RimeA/PhFYcc=
=tmDI
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-19 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/19/2014 4:05 PM, Robert White wrote:
 It's cheaper, and less error prone, and less likely to generate
 customer returns if the generic controller chips just send init,
 wait a fixed delay, then request a status compared to trying to
 are-you-there-yet poll each device like a nagging child. And you
 are going to see that at every level. And you are going to see it
 multiply with _sparsely_ provisioned buses where the cycle is going
 to be retried for absent LUNs (one disk on a Wide SCSI bus and a
 controller set to probe all LUNs is particularly egregious)

No, they do not wait a fixed time, then proceed.  They do in fact
issue the command, then poll or wait for an interrupt to know when it
is done, then time out and give up if that doesn't happen within a
reasonable amount of time.

 One of the reasons that the whole industry has started favoring 
 point-to-point (SATA, SAS) or physical intercessor chaining 
 point-to-point (eSATA) buses is to remove a lot of those
 wait-and-see delays.

Nope... even with the ancient PIO mode PATA interface, you polled a
ready bit in the status register to see if it was done yet.  If you
always waited 30 seconds for every command your system wouldn't boot
up until next year.

 Another strong actor is selecting the wrong storage controller
 chipset driver. In that case you may be faling back from high-end
 device you think it is, through intermediate chip-set, and back to
 ACPI or BIOS emulation

There is no such thing as ACPI or BIOS emulation.  AHCI SATA
controllers do usually have an old IDE emulation mode instead of AHCI
mode, but this isn't going to cause ridiculously long delays.

 Another common cause is having a dedicated hardware RAID
 controller (dell likes to put LSI MegaRaid controllers in their
 boxes for example), many mother boards have hardware RAID support
 available through the bios, etc, leaving that feature active, then
 the adding a drive and

That would be fake raid, not hardware raid.

 _not_ initializing that drive with the RAID controller disk setup.
 In this case the controller is going to repeatedly probe the drive
 for its proprietary controller signature blocks (and reset the
 drive after each attempt) and then finally fall back to raw block
 pass-through. This can take a long time (thirty seconds to a
 minute).

No, no, and no.  If it reads the drive and does not find its metadata,
it falls back to pass through.  The actual read takes only
milliseconds, though it may have to wait a few seconds for the drive
to spin up.  There is no reason it would keep retrying after a
successful read.

The way you end up with 30-60 second startup time with a raid is if
you have several drives and staggered spinup mode enabled, then each
drive is started one at a time instead of all at once so their
cumulative startup time can add up fairly high.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUbQ/qAAoJEI5FoCIzSKrwuhwH/R/+EVTpNlw36naJ8mxqMagt
/xafq+1kGhwNjLTPV68CI4Wt24WSGOLqpq5FPWlTMxuN0VSnX/wqBeSbz4w2Vl3F
VNic+4RqhmzS3EnLXNzkHyF2Z+hQEEldOlheAobkQb4hv/7jVxBri42nMdHQUq5w
em181txT8zkltmV+dm8aYcro8Z4ewntQtyGaO6U/nCfxt9Odr2rfytyeuSyJi9uY
+dKlGSb5klIFwCOOSoRqEz2+KOFHF7td9RrcfIRcPRgjKROH0YilQ8T53lTMoNL1
aUMsbyUy+edEBN1a4o/FqK3dEvBSu1nnRGRpSgm2fFGKhyi/z9gmJ1ZXTdYZRXE=
=/O7+
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs on a failing drive

2014-11-18 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Please get in the habit of using your mail client's reply-to-all
button instead of reply; there is no need for us to take this
conversation private.

On 11/17/2014 10:15 PM, Fennec Fox wrote:
snip big smartctl output
 i know the drive is dying and needs replacing   but i need to keep 
 this drive arround for some time longer   as i cant run from a 32
 gb usbfar too slow

If it were just a few bad sectors, then you could deal with that by
writing to them, which would force the drive to reallocate them from
the spare pool.  I'd suggest you dd /dev/zero all over the drive so
everything is written to, then check the smart stats again.  If there
were no write errors, and the smart stats show zero pending sectors,
then everything has been reallocated and you should be ok to reformat
the drive and use it.

As I said before though, the errors you posted from dmesg don't
indicate that the drive failed to read sectors, but rather that it
returned incorrect data, and this is *NEVER* supposed to happen.

I'd suggest running a few passes of badblocks over the drive, testing
writing different patterns and verifying that they read back
correctly.  If it can't do that, then there's nothing for it but to
junk the drive.

badblocks -b 4096 -c 256 -s -t 00 /dev/sda

That will read the drive and verify that it is full of zeros.  If that
passes, write a different pattern to the disk and verify that reads
back correctly:

badblocks -b 4096 -c 256 -s -w /dev/sda


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUa2duAAoJEI5FoCIzSKrw+0AIAJNAqF1rY2m5Oalehr3dz+G4
O6h9XERRiTl8GVMgcj7ZybeP3sFroItgiki5UdhRsjNoPEPRQpv3hApY7p2cEUtk
yNn8jAeRBjA0kli+5HMHY3eHL4RmLO3mrLmNoAu5HShvWBE4zj/18vvk15m/u5rj
SnrxBUSQ91V0D6p/CFkjAX9iBZBoWx4+J7Wz8EOhqnFJbqXaCEOdj7NKrjQ/7r+Q
5gxQWD4x54NQSGPfexERtRRaL9drE3JoLTbOEC+xdt7a9MwHw5Z50DTfMRzibpFP
kdKlRCLMzcNGXSVt/187MMbpvROXBWhfmAAFOCz5rGtrGjX3V6+/7hpPBn5ft3E=
=L5No
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS messes up snapshot LV with origin

2014-11-18 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/18/2014 1:16 AM, Chris Murphy wrote:
 If fstab specifies rootfs as UUID, and there are two volumes with
 the same UUID, it’s now ambiguous which one at boot time is the
 intended rootfs. It’s no different than the days of /dev/sdXY where
 X would change designations between boots = ambiguity and why we
 went to UUID.

He already said he has NOT rebooted, so there is no way that the
snapshot has actually been mounted, even if it were UUID confusion.

 So we kinda need a way to distinguish derivative volumes. Maybe
 XFS and ext4 could easily change the volume UUID, but my vague 
 recollection is this is difficult on Btrfs? So that led me to the 
 idea of a way to create an on-the-fly (but consistent) “virtual 
 volume UUID” maybe based on a hash of both the LVM LV and fs
 volume UUID.

When using LVM, you should be referring to the volume by the LVM name
rather than UUID.  LVM names are stable, and don't have the duplicate
uuid problem.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUa2j4AAoJEI5FoCIzSKrwvywH/3yS25MAIwsGfIwBfCrNN5Qo
NlBttcUcrYgOD/nQHEuulHdilWrvz3q6jGwVL9W8MQsHm0Ah5dMatT5e5zr1DSNC
ZqSEXSE8jsYJu99FUWevxO7wtb94ioKa+OF1u0zsaA5yQUdaj5smPqK3iUfskUhs
jE/vsJmws5iBv0dxnZI/6n3YqOB1Qck4PcMItRj8xvZQ0GjARIVw36pgJnmboGfY
vWRmUXnTeLMu9ilHWhqNUIh3lTTUvRdaYoZtTr6eYh9sIntDCegN71WGmO8FfdjP
vXhikg7Yx7FhkhxAl1X2NzM93d7fUSQDeQfTLYLMDbbTV/n2HwcoZ6G2+IQEJnQ=
=3Lv1
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/18/2014 7:08 AM, Austin S Hemmelgarn wrote:
 In addition to the storage controller being a possibility as
 mentioned in another reply, there are some parts of the drive that
 aren't covered by SMART attributes on most disks, most notably the
 on-drive cache. There really isn't a way to disable the read cache
 on the drive, but you can disable write-caching, which may improve
 things (and if it's a cheap disk, may provide better reliability
 for BTRFS as well).  The other thing I would suggest trying is a
 different data cable to the drive itself, I've had issues with some
 SATA cables (the cheap red ones you get in the retail packaging for
 some hard disks in particular) having either bad connectors, or bad
 strain-reliefs, and failing after only a few hundred hours of use.

SATA does CRC the data going across it so if it is a bad cable, you
get CRC, or often times 8b10b coding errors and the transfer is
aborted rather than returning bad data.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUa22PAAoJEI5FoCIzSKrwqlAH/3p1iftYkX3DAMgMmWra9AZT
2OA4PIwzgKIhANpy+ZQo4c+W1ZUwo2V6sxLvG8/oM3HfITGyfwNA5HgTbQrlx/iU
vdRHq+y60gCruIa0lRST5JCQMbez7eXvSNOWNAZYbtNH/BNyMxwFuav14zFZpNxO
QovXxhk1D5vLf+ID2jwa5mF1Zj7b5GEhb4zzqK+xU1QNeWppLFhB3da+llae8qxf
eFtNt8ebtknr7QMCFrbaYCq1z1I+Fy8EjskkdI4ZW6AgBRPQDDmB8gNCmAAbSaZC
2Ze/AB4Xr6uuGQ4iK7nprKXUtPJFLzGYx+JQ2EeBJtin9ivno1fEY45CMreuzv4=
=6Oy/
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/18/2014 10:35 AM, Marc MERLIN wrote:
 Try running hdrecover on your drive, it'll scan all your blocks and
 try to rewrite the ones that are failing, if any: 
 http://hdrecover.sourceforge.net/

He doesn't have blocks that are failing; he has blocks that are being
silently corrupted.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUa23wAAoJEI5FoCIzSKrwXTMH/3KhuXuNbPBY0jALRS6kVAew
M3gfJ1kMeZgiBZzUlZb0GsB9J3i+Ei+nF7NQ7taMKey84sPxhQVjpYZV0LZxWNwe
RSga4/Kfnk8TGphwBBeK5e3tOypmv+ECCB4p4uQHXqPAvoFiIALdHYzZGYb0kM8e
ydTonqtUiR8WJ0uqy24/vl7uJyTkj0xz4Adk2ksrbVhW1Z8md2LesKOCtCLa3bVn
Qu8Um/KIBPNBbB21FYN1KyBUMvkx2uGDcu7YRfxXpPnZLwZ9NdMjlOzY8P+EnhFt
cW+tW3mYO9BMhONxi8m7hDI5wj+dsPFblqA5CRBwAOG5b4fsE2pwZwdqYoASmd4=
=2Ho1
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/18/2014 11:11 AM, Marc MERLIN wrote:
 That seems to be the case, but hdrecover will rule that part out at
 least.

It's already ruled out: if the read failed that is what the error
message would have said rather than a bad checksum.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEbBAEBAgAGBQJUa3MpAAoJEI5FoCIzSKrwmu4H+IPNwUZMEES7vvA7WTPcrYgw
mO2x9uR/fQJFH1u4Urf3anKXoifsHUgvgyPHotRrm1OoiB3bQgYVapVEqZ0PEkre
la3zKydJ6ZuCa/TuEvATdOxBwvUhMKJCYcwYheja+1stqEBxD8mj6HY5+HqufoLo
VaSeEeBDWvQZtGrOC8JNxfzaeFmf46W+8dQIn7qI72WYvWRfVMhCun+dR4amS8hN
cXgxAe6ElnVV4TuGHLy0n4l2Hr6oWBYLWIJhDzM9IpkfjX9jsv78nLHcoWwtaw82
gv248OcCeLnZBwoN5Tepd5Av6uHh3x9MzlXDrqnWQBWulY3f0idrFGU1y1uZvw==
=AtDf
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/18/2014 1:57 PM, Chris Murphy wrote:
 So a.) use smartctl -l scterc to change the value below 30 seconds 
 (300 deciseconds) with 70 deciseconds being reasonable. If the
 drive doesn’t support SCT commands, then b.) change the linux scsi
 command timer to be greater than 120 seconds.
 
 Strictly speaking the command timer would be set to a value that 
 ensures there are no link reset messages in dmesg, that it’s long 
 enough that the drive itself times out and actually reports a read 
 error. This could be much shorter than 120 seconds. I don’t know
 if there are any consumer drives that try longer than 2 minutes to 
 recover data from a marginally bad sector.

Are there really any that take longer than 30 seconds?  That's enough
time for thousands of retries.  If it can't be read after a dozen
tries, it ain't never gonna work.  It seems absurd that a drive would
keep trying for so long.

 Ideally though, don’t use drives that lack SCT support in multiple 
 device volume configurations. An up to 2 minute hang of the
 storage stack isn’t production compatible for most workflows.

Wasn't there an early failure flag that md ( and therefore, btrfs when
doing raid ) sets so the scsi stack doesn't bother with recovery
attempts and just fails the request?  Thus if the drive takes longer
than the scsi_timeout, the failure would be reported to btrfs, which
then can recover using the other copy, write it back to the bad drive,
and hopefully that fixes it?

In that case, you probably want to lower the timeout so that the
recover kicks in sooner instead of hanging your IO stack for 30 seconds.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUa7LqAAoJEI5FoCIzSKrw2Y0H/3Q03vCTxXeGkqvOYG/arZgk
yHq/ruWIKMgfaESdu0Ujzoqbe7XopUueU8luKon52LtbgIFhOM5XnMu/o52KPXIS
CVLnNtRWNbykHJMQu0Sk4lpPrUVI5QP9Ya9ZGVFM4x2ehvJGDAT+wcRWP5OH0waf
mgK+oOnadsckqiSbcQhGrxecjTWZFu5WUCzWFPx+4sEV5ta/tmL0obhHcyho+SDN
lCib2KI9YGzS2sm+V/Qe2i/3ZMp8QY8aAD2x/KlV0DBxkRLZQdOoD3ZkBiaApxZX
VMfXNCKLMexwpe+rGGemH/fCvhRpM/z1aHu8D1u4QVnoWPzD51vX7ySLkwRHaGo=
=XZkM
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: What is the vision for btrfs fs repair?

2014-11-17 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 10/11/2014 3:29 AM, Goffredo Baroncelli wrote:
 On 10/10/2014 12:53 PM, Bob Marley wrote:
 
 If true, maybe the closest indication we'd get of btrfs
 stablity is the default enabling of autorecovery.
 
 No way! I wouldn't want a default like that.
 
 If you think at distributed transactions: suppose a sync was
 issued on both sides of a distributed transaction, then power was
 lost on one side, than btrfs had corruption. When I remount it,
 definitely the worst thing that can happen is that it
 auto-rolls-back to a previous known-good state.
 
 I cannot agree. I consider a sane default to have a consistent
 state with the recently data written lost, instead of require
 the user intervention to not lost anything.
 
 To address your requirement, we need a super sync command which 
 ensure that the data are in the filesystem and not only in the log
 (as sync should ensure).

I have to agree.  There is a reason we have fsck -p and why that is what
is run at boot time.  Some repairs involve a tradeoff that will result
in permanent data loss that maybe could be avoided by going the other
way, or performing manual recovery.  Such repairs should never be done
automatically by default.

For that matter I'm not even sure this sort of thing should be there as
a mount option at all.  It really should require a manual fsck run with
a big warning that *THIS WILL THROW OUT SOME DATA*.

Now if the data is saved to a snapshot or something so you can manually
try to recover it later rather than being thrown out wholesale, I can
see that being done automatically at boot time.  Of course, if btrfs is
that damaged then wouldn't grub be unable to load your kernel in the
first place?

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUamDQAAoJEI5FoCIzSKrwaYAIAKXgkGBbBZj6yUuLC1+euim6
6Xqer1DiGywEiO4UPaxmq3rHDOlZlyIamDpUi7nIvbfK+TgBWfEVtLvdd6shjfqA
FvFv7t+X2mlAyk+iGffSK1w9/qgEhE55M35exba95Cdsn0ezos4LpvTsL1128nkx
uGzYQcoYj1irkmDp133JuHYAxhrAp0Q6PB+5gIgWfRsVbGezcxg5FvqzotEq1J/d
7MT1FvdoUo5qt2j/KzTUfD5AlFhsXE5beykakMdFmoHlTCQAxEeUU21z6APclkxF
/b/ppLt603Vpb6rpKvNUyBy1TuPr6FJEx5O2qWUWlhRxkOUB98M86KHyWVBHtMM=
=uG+h
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs on a failing drive

2014-11-17 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On 11/17/2014 05:55 PM, Fennec Fox wrote:
 well i am an arch linux user and machine owner using a failing
 drive its still relyable enough for me but btrfs seems not to mark
 bad blocks as unusable and continues to try to write to them. 
 /bbs.archlinux.org/viewtopic.php?pid=1476540#p1476540  this forum
 post has a few more details regarding the problem  i really need a
 bit of help  thank you

If indeed writes are failing then the drive is only suitable for a
door stop.  Drives remap bad sectors to a spare pool on write so if it
is now failing writes, it has already exhausted its spare pool and you
should have replaced it long ago.  Have a look at its SMART stats and
it will probably confirm the drive is fubar.


 [   83.050733] BTRFS info (device sda1): csum failed ino 3048916
 off 33030144 csum 1217419445 expected csum 510562246 [   83.052317]
 BTRFS info (device sda1): csum failed ino 3048916 off 33030144 csum
 1217419445 expected csum 510562246

That's not saying writes are failing; it is saying that your data has
been silently corrupted, which means the drive is the worst kind of
broken and should be thrown in a fire at once.


-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBCgAGBQJUap4OAAoJENRVrw2cjl5RwBAH/1ceBd4i7WD7679x3bshYYTi
Lv63xLRMjbo+T0md3ptcndyxFbZlRdWQiJbIKT40yn9xnqOWeXWTkSmODqGyEOdC
M9HSlfZg8fOAha4kb7k1tzzqxdR1J3iAj03/G0B4+YKY0I7AaGdzhGLRAY8EVtRW
UVG99451wwRyUpg3YLk+n12MMSlq8Sy9XSjMU5/ECDzemH5GF6pPNi39nCy6JFti
oaTOwnAROfb7L3Y9ZBiIJ52Y7p4UIdS1jaSkLw0U2g0Gz+5V1/fb1hOhK5J/loYy
bC4JyoJsxn9GyJGwM93s64aWE5X+N+i7RzmysQVBI/3wepGXpG0Tsq37NnKB3iU=
=BctV
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Restriper documentation

2013-04-11 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

It looks like the restriper patches got merged last year, but never
documented in the man page.  Could you do that?

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRZwZYAAoJEJrBOlT6nu75sAYIAKdXp17gP1dmbSu2yImzaRDK
VAFX6RkeMknjkTU9/GtmaDeCM/JEHNNqH7EKLHGzJIomMJSbZq34iTbkgDNXxikh
uyhXpNXz0ghqJP+AmQ8GHw27LZ2Gi7T1o1AYed9JcCqxRRyRB2+822HQWWVzA30M
LYUQOMyy4Rq1RR1vOnVs1XPLsb4gc7yQtwXlvIU9yi4euWaQYbmEr29nSLTYi5+m
CBE686XgMzbDZB84VjUwGCnHzYeXofARUN1xjfeEy6g4n/NRbUJgiDyR/iG3gMqU
FeqHO1j1+A8i6ljxTEzxZAKwIi0DOaFbgN/8kx8vIsjL3NzP79WlSex/LMUjMPU=
=ncMQ
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Varying Leafsize and Nodesize in Btrfs

2012-10-11 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 8/30/2012 12:25 PM, Josef Bacik wrote:
 Once you cross some hardware dependant threshold (usually past 32k)
 you start incurring high memmove() overhead in most workloads.
 Like all benchmarking its good to test your workload and see what
 works best, but 16k should generally be the best option.  Thanks,
 
 Josef

Why are memmove()s neccesary, can they be avoided, and why do they
incur more overhead with 32k+ sizes?



-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iQEcBAEBAgAGBQJQdwjZAAoJEJrBOlT6nu75rkYH/RYXBbAJfIG2KmmmFA8kSIiL
EEvdA9KRnVH08h2lnB26xNdCPbf59M7GrH2hZK48gM9x4OQPzKXf8eCTYTy4mFKy
mqTPFsgcPveTFtgoRXkuhZvUXMpFV4M8I7MLZRCcxk5KWTwA/slcunQxG7BMz/V4
tBxE8ya2Hxej2VJe4AbLR6PJbvCGsFXNlxBpUy9Qh7q0TmDeGzsoaZ1We1itNjQZ
wWjTerka2qe9dyP8EOUp/uZqGUQXu1TUKbTLygsfMb11/vGMkoUkZtTa0f9lQosw
10UlA8TyqAkLX3gpQzsJVCwiRuNWQBbQqvdYq3dCQOgzBbvOdvD6TtmeS1saO4o=
=qV0c
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: not enough space with data raid0

2012-03-28 Thread Phillip Susi

On 3/17/2012 8:19 AM, Hugo Mills wrote:

Where is the problem, how can I use the full space?


You can't. btrfs requires RAID-0 to be at least two devices wide
(otherwise it's not striped at all, which is the point of RAID-0). If
you want to use the full capacity of both disks and don't care about
the performance gain from striping, use -d single (which is the
default). If you do care about the performance gain from striping,
then you're going to have to lose some usable space.


So currently btrfs's concept of raid0 is stripe across as many disks as 
possible, with a minimum of 2 disks.  Is there any reason for that 
minimum?  I don't see why it can't allow only one if that's the best it 
can manage.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: getdents - ext4 vs btrfs performance

2012-03-14 Thread Phillip Susi

On 3/13/2012 5:33 PM, Ted Ts'o wrote:

Are you volunteering to spearhead the design and coding of such a
thing?  Run-time sorting is backwards compatible, and a heck of a lot
easier to code and test...


Do you really think it is that much easier?  Even if it is easier, it is 
still an ugly kludge.  It would be much better to fix the underlying 
problem rather than try to paper over it.



The reality is we'd probably want to implement run-time sorting
*anyway*, for the vast majority of people who don't want to convert to
a new incompatible file system format.  (Even if you can do the
conversion using e2fsck --- which is possible, but it would be even
more code to write --- system administrators tend to be very
conservative about such things, since they might need to boot an older
kernel, or use a rescue CD that doesn't have an uptodate kernel or
file system utilities, etc.)


The same argument could have been made against the current htree 
implementation when it was added.  I think it carries just as little 
weight now as it did then.  People who care about the added performance 
the new feature provides can use it, those who don't, won't.  Eventually 
it will become ubiquitous.



We would still have to implement the case where hash collisions *do*
exist, though, and make sure the right thing happens in that case.
Even if the chance of that happening is 1 in 2**32, with enough
deployed systems (i.e., every Android handset, etc.) it's going to
happen in real life.


Sure it will happen, but if we read one extra block 1 in 4 billion 
times, nobody is going to notice.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: getdents - ext4 vs btrfs performance

2012-03-13 Thread Phillip Susi

On 3/9/2012 11:48 PM, Ted Ts'o wrote:

I suspect the best optimization for now is probably something like
this:

1) Since the vast majority of directories are less than (say) 256k
(this would be a tunable value), for directories which are less than
this threshold size, the entire directory is sucked in after the first
readdir() after an opendir() or rewinddir().  The directory contents
are then sorted by inode number (or loaded into an rbtree ordered by
inode number), and returned back to userspace in the inode order via
readdir().  The directory contents will be released on a closedir() or
rewinddir().


Why not just separate the hash table from the conventional, mostly in 
inode order directory entries?  For instance, the first 200k of the 
directory could be the normal entries that would tend to be in inode 
order ( and e2fsck -D would reorder ), and the last 56k of the directory 
would contain the hash table.  Then readdir() just walks the directory 
like normal, and namei() can check the hash table.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: getdents - ext4 vs btrfs performance

2012-03-13 Thread Phillip Susi

On 3/13/2012 3:53 PM, Ted Ts'o wrote:

Because that would be a format change.


I think a format change would be preferable to runtime sorting.


What we have today is not a hash table; it's a hashed tree, where we
use a fixed-length key for the tree based on the hash of the file
name.  Currently the leaf nodes of the tree are the directory blocks
themselves; that is, the lowest level of the index blocks tells you to
look at directory block N, where that directory contains the directory
indexes for those file names which are in a particular range (say,
between 0x2325777A and 0x2325801).


So the index nodes contain the hash ranges for the leaf block, but the 
leaf block only contains the regular directory entries, not a hash for 
each name?  That would mean that adding or removing names would require 
moving around the regular directory entries wouldn't it?



If we aren't going to change the ordering of the directory directory,
that means we would need to change things so the leaf nodes contain
the actual directory file names themselves, so that we know whether or
not we've hit the correct entry or not before we go to read in a
specific directory block (otherwise, you'd have problems dealing with
hash collisions).  But in that case, instead of storing the pointer to
the directory entry, since the bulk of the size of a directory entry
is the filename itself, you might as well store the inode number in
the tree itself, and be done with it.


I would think that hash collisions are rare enough that reading a 
directory block you end up not needing once in a blue moon would be 
chalked up under who cares.  So just stick with hash, offset pairs to 
map the hash to the normal directory entry.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: getdents - ext4 vs btrfs performance

2012-03-08 Thread Phillip Susi

On 2/29/2012 11:44 PM, Theodore Tso wrote:

You might try sorting the entries returned by readdir by inode number
before you stat them.This is a long-standing weakness in
ext3/ext4, and it has to do with how we added hashed tree indexes to
directories in (a) a backwards compatible way, that (b) was POSIX
compliant with respect to adding and removing directory entries
concurrently with reading all of the directory entries using
readdir.


When I ran into this a while back, I cobbled together this python script 
to measure the correlation from name to inode, inode to first data 
block, and name to first data block for all of the small files in a 
large directory, and found that ext4 gives a very poor correlation due 
to that directory hashing.  This is one of the reasons I prefer using 
dump instead of tar for backups, since it rips through my Maildir more 
than 10 times faster than tar, since it reads the files in inode order.


#!/usr/bin/python

import os
from stat import *
import fcntl
import array

names = os.listdir('.')
lastino = 0
name_to_ino_in = 0
name_to_ino_out = 0
lastblock = 0
name_to_block_in = 0
name_to_block_out = 0
iblocks = list()
inode_to_block_in = 0
inode_to_block_out = 0

for file in names :
try :
st = os.stat(file)
except OSError:
continue
if not S_ISREG(st.st_mode) :
continue
if st.st_ino  lastino :
name_to_ino_in += 1
else : name_to_ino_out += 1
lastino = st.st_ino
f = open(file)
buf = array.array('I', [0])
err = fcntl.ioctl(f.fileno(), 1, buf)
if err != 0 :
print ioctl failed on  + f
block = buf[0]
if block != 0 :
if block  lastblock :
name_to_block_in += 1
else : name_to_block_out += 1
lastblock = block
iblocks.append((st.st_ino,block))
print Name to inode correlation:  + str(float(name_to_ino_in) / 
float((name_to_ino_in + name_to_ino_out)))
print Name to block correlation:  + str(float(name_to_block_in) / 
float((name_to_block_in + name_to_block_out)))
iblocks.sort()
lastblock = 0
for i in iblocks:
if i[1]  lastblock:
inode_to_block_in += 1
else: inode_to_block_out += 1
lastblock = i[1]
print Inode to block correlation:  + str(float(inode_to_block_in) / 
float((inode_to_block_in + inode_to_block_out)))


Re: btrfs-raid questions I couldn't find an answer to on the wiki

2012-02-11 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/11/2012 12:48 AM, Duncan wrote:
 So you see, a separate /boot really does have its uses. =:^)

True, but booting from removable media is easy too, and a full livecd gives
much more recovery options than the grub shell.  It is the corrupted root
fs that is of much more concern than /boot.

 I like the raid-10 idea and will have to research it some more as I 
 understand the idea behind near and far on raid10, but having never 
 used raid-10, I don't grok that idea, understand it well enough to have 
 appreciated the possibility for lose-an-two, before you suggested it.

To grok the other layouts, it helps to think of the simple two disk case.
A far layout is like having a raid0 across the first half of the disk, then
mirroring the whole first half of the disk onto the second half of the other
disk.  Offset has the mirror on the next stripe so each stripe is interleaved
with a mirror stripe, rather than having all original, then all mirrors after.

It looks like mdadm won't let you use both at once, so you'd have to go with
a 3 way far or offset.  Also I was wrong about the additional space.  You
would only get 25% more space since you still have 3 copies of all data so
you get 4/3 times the space, but you will get much better throughput since
it is striped across all 4 disks.  Far gives better sequential read since it
reads just like a raid0, but writes have to seek all the way across the disk
to write the backup.  Offset requires seeks between each stripe on read, but
the writes don't have to seek to write the backup.

You also could do a raid6 and get the double failure tolerance, and two disks
worth of capacity, but not as much read throughput as raid10.

 But I believe I'll keep multiple raids for much the same reason I keep 
 multiple partitions, it's a FAR more robust solution than having all 
 one's eggs in one RAID basket.

True.

 Besides, I actually did try a single partitioned RAID (well, two, one for 
 all the working copies, one for the backups) when I first setup md/raid, 
 and came to the conclusion that the recovery time on that big a raid is 
 rather longer than I like to be dealing with it.  Multiple raids, with 
 the ones I'm not using ATM offline, means I don't have to worry about 
 recovering the entire thing, only the raids that were online and actually 
 dirty at the time of crash or whatever.  And of course write-intent 
 bitmaps means even shorter recovery time in most cases, so between 
 multiple raids and write-intent-bitmaps, a recovery that would take 2-3 
 hours with my original all-in-one raid setup, now often takes  5 
 minutes! =:^)  Even with write-intent-bitmaps, I'd hate to go back to big 
 all-in-one raids, for recovery reasons alone, and between that and the 
 additional robustness of multiple raids, I just don't see myself doing 
 that any time soon.

Depends on what you mean by recovery.  Re-adding a drive that you removed
will be faster with multiple raids ( though write-intent bitmaps also take
care of that ), but if you actually have a failed disk and have to replace
it with a new one, you still have to do a rebuild on all of the raids
so it ends up taking the same total time.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPNwIZAAoJEJrBOlT6nu754yUIAL79DHhanAC0SWaXFBYTT4T2
N2xG3ved177BXX0VhKCcoYcWFiSerWzAnPlZsUDzMfaHDxBNF4ATsnboY31rCG1j
QJE3Oz9Cop45xhTBrMcwYs+woR+0HAmYb1Qa1aKrNwG0d6XlfZsLFBFUtrB411lX
erOS77EsT2BYaumanvouM8vm5LG9ZrOItiELI7rm+hEcw64p3rjkUkvBG5nTdj8K
0x7tYgUHEZNngMSx4rMTUFTlx9485gn7eJ2hT1gbVNmRcCGwotTpOTXoJMh3csbF
jYbUJKqK0n+gxhHSW/+KJBTlb1gbZpuaiibqpQnUlOecI/Fmj2MpHQnZ4WSNpc8=
=HjvY
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Packed small files

2012-02-10 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 1/31/2012 11:46 AM, Hugo Mills wrote:
 So you're looking at a minimum of 413 bytes of metadata overhead 
 for an inline file, plus the length of the filename.
 
 Also note that the file is stored in the metadata, so by default 
 it's stored with DUP or RAID-1 replication (even if data is set to
 be single). This means that you'll actually use up twice this
 amount of space on the disks, unless you create the FS with
 metadata set to single.
 
 I don't know how these figures compare with other filesystems. My 
 entirely uneducated guess is that they're probably comparable,
 with the exception of the DUP effect.

On ext4 you are looking at 256 bytes for the inode, name length + a
few bytes for the directory entry, another few bytes for the hashed
directory entry, and a whole 4k block to hold the data, so ~4300 bytes
( + name length ) of overhead to store a 64 byte file.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPNWnBAAoJEJrBOlT6nu75lVkIAIO2mjYeVK5BbMNfw5HJ7jZO
WfIBv5xR8V06e0VLgv4FQqPlWcm+ZQHorYDM7h15q4cIgoZ3x0P3n3bSCurFRLfF
lSRjn/fsX1Y9isPEB6/monPm+08U6qh7jXGldEMOLKaA7VG/QOVR01k3W2a3FkJ4
kWBjEbK/xE013WaQnfR26PydRT8ILRzGUE4uEKGsdV39JkcEorQ1lDg+XWz5Hvy7
VmelT21272PssIUbRub1QkZXj6p0SUu1zeU1IwOdt6X1uXFcWqFbBFRGJk4f2+ZM
5MquuVC+YrzfDIBnS0ZBZ4UqNmxYuPCSzTLlpPiJJiY/AwR7916H/CoF5k38k/M=
=8YmX
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-raid questions I couldn't find an answer to on the wiki

2012-02-10 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 1/31/2012 12:55 AM, Duncan wrote:
 Thanks!  I'm on grub2 as well.  It's is still masked on gentoo, but
 I recently unmasked and upgraded to it, taking advantage of the
 fact that I have two two-spindle md/raid-1s for /boot and its
 backup to test and upgrade one of them first, then the other only
 when I was satisfied with the results on the first set.  I'll be
 using a similar strategy for the btrfs upgrades, only most of my
 md/raid-1s are 4-spindle, with two sets, working and backup, and
 I'll upgrade one set first.

Why do you want to have a separate /boot partition?  Unless you can't
boot without it, having one just makes things more
complex/problematic.  If you do have one, I agree that it is best to
keep it ext4 not btrfs.

 Meanwhile, you're right about subvolumes.  I'd not try them on a
 btrfs /boot, either.  (I don't really see the use case for it, for
 a separate /boot, tho there's certainly a case for a /boot
 subvolume on a btrfs root, for people doing that.)

The Ubuntu installer creates two subvolumes by default when you
install on btrfs: one named @, mounted on /, and one named @home,
mounted on /home.  Grub2 handles this well since the subvols have
names in the default root, so grub just refers to /@/boot instead of
/boot, and so on.  The apt-btrfs-snapshot package makes apt
automatically snapshot the root subvol so you can revert after an
upgrade.  This seamlessly causes grub to go back to the old boot menu
without the new kernels too, since it goes back to reading the old
grub.cfg in the reverted root subvol.

I have a radically different suggestion you might consider rebuilding
your system using.  Partition each disk into only two partitions: one
for bios_grub, and one for everything else ( or just use MBR and skip
the bios_grub partition ).  Give the second partitions to mdadm to
make a raid10 array out of.  If you use a 2x far and 2x offset instead
of the default near layout, you will have an array that can still
handle any 2 of the 4 drives failing, will have twice the capacity of
a 4 way mirror, almost the same sequential read throughput of a 4 way
raid0, and about twice the write throughput of a 4 way mirror.
Partition that array up and put your filesystems on it.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPNXPnAAoJEJrBOlT6nu75/d8IAJ0fQ3xWPe6SYBY8nj34mcWh
ql6C4ieMkd07ZCuymT5ZVhWJhtdc6/Vg7ecWmhYdeu4d1WGp4DvTumEYHVl4ZlRk
mT9Lq4SupDL5Dk0nfxZUqY8XnIek3kIG/wgekgdSuLF0J9QFQdCFc25j/idIh0Dy
Gk5NJtgKmsTKUQhzPQZxif8nwWVQzQICm5P//FeOQgx8sq7iVdCQHUxlJEPfsL7m
CVVMJPVk+524rFTWxLZ4KLbXkNE7nrikg7UMlWBtM5gflkU0Y+bfmZKPGcqBCSSn
AId5M5alzjLSLblBqwf8wKpEIiDXBqb6f+bSxqnk5FdKKx5l5lziZyqQM+gnyIo=
=ePD3
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mkfs: Handle creation of filesystem larger than the first device

2012-02-08 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 1/26/2012 11:03 AM, Jan Kara wrote:
 make_btrfs() function takes a size of filesystem as an argument. It
 uses this value to set the size of the first device as well which
 is wrong for filesystems larger than this device. It results in
 'attemp to access beyond end of device' messages from the kernel.
 So add size of the first device as an argument to make_btrfs().

I don't think this patch is correct.  Yes, the size switch only
applies to the first device, so it doesn't make any sense to try to
use a value larger than that device.  Attempting to do so probably
should be trapped and it should error out, and the man page should
probably clarify the fact that the size is only for the first device.

It looks like you think the size should somehow apply to multiple
devices or to the total fs size when creating on multiple devices, and
that just doesn't make sense.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPMvCrAAoJEJrBOlT6nu75A/IH/3Pn9MFhxXI1kTu1jriA/1ZA
IaCPkZbNFvS1DC5U8E75Ys4Qtn/SkwVOdGG8VCObfJKhhbWXEGKLdtllxV8WUkRM
QYN3rFeb3gLxb9UIcyyRC+RDtJtVzVXClFN7WYgA2QXmCyYdnV3axzvO/tkvADuq
Is28sKnYzV9poKTlghqFmEGmqcnTtfFKg9MC60wGDKEOMuAeijImGaAEp773G7+S
JSOOPcuDj/Lh7ZO+duR2ul+zUI9DWr2IbZM6zUxOoN2fZEJAwJLNPsU7rBDX8w+g
FVHFHrRv6wVGq0I7Dvb2flif5O0wSRA5yhK/7GanaVEMBKSV9A0c5qOE9LakL/s=
=X7QW
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How long does it take to balance a 2x1TB RAID1 ?

2012-01-09 Thread Phillip Susi

On 1/6/2012 6:23 AM, Dirk Lutzebäck wrote:

Hi,

I have setup up a btrfs RAID1 using two 1TB drives. How long should a
'btrfs filesystem balance' take? It is running now for more than 3 days
on about 30% CPU and 40% wait state.

I am using stock btrfs from ubuntu 11.10 kernel 3.0.0


Not nearly that long.  Assuming it actually has to rewrite 1 TB of data 
and is only getting 50 MB/s, that should only take about 5.5 hours.  You 
might want to try a newer kernel ( like the one from 12.04 ).

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: change resize ioctl to take device path instead of id

2012-01-09 Thread Phillip Susi

Bump.

On 12/11/2011 10:12 PM, Phillip Susi wrote:

The resize ioctl took an optional argument that was a string
representation of the devid which you wish to resize.  For
the sake of consistency with the other ioctls that take a
device argument, I converted this to take a device path instead
of a devid number, and look up the number from the path.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs-progs: document --rootdir mkfs switch

2012-01-09 Thread Phillip Susi

Signed-off-by: Phillip Susi ps...@cfl.rr.com
---
 man/mkfs.btrfs.8.in |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/man/mkfs.btrfs.8.in b/man/mkfs.btrfs.8.in
index 542e6cf..25e817b 100644
--- a/man/mkfs.btrfs.8.in
+++ b/man/mkfs.btrfs.8.in
@@ -12,6 +12,7 @@ mkfs.btrfs \- create an btrfs filesystem
 [ \fB\-M\fP\fI mixed data+metadata\fP ]
 [ \fB\-n\fP\fI nodesize\fP ]
 [ \fB\-s\fP\fI sectorsize\fP ]
+[ \fB\-r\fP\fI rootdir\fP ]
 [ \fB\-h\fP ]
 [ \fB\-V\fP ]
 \fI device\fP [ \fIdevice ...\fP ]
@@ -59,6 +60,9 @@ Specify the nodesize. By default the value is set to the 
pagesize.
 \fB\-s\fR, \fB\-\-sectorsize \fIsize\fR
 Specify the sectorsize, the minimum block allocation.
 .TP
+\fB\-r\fR, \fB\-\-rootdir \fIrootdir\fR
+Specify a directory to copy into the newly created fs.
+.TP
 \fB\-V\fR, \fB\-\-version\fR
 Print the \fBmkfs.btrfs\fP version and exit.
 .SH AVAILABILITY
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs-progs: removed extraneous whitespace from mkfs man page

2012-01-09 Thread Phillip Susi
There were extra spaces around some of the arguments in the man
page for mkfs.

Signed-off-by: Phillip Susi ps...@cfl.rr.com
---
 man/mkfs.btrfs.8.in |   19 ++-
 1 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/man/mkfs.btrfs.8.in b/man/mkfs.btrfs.8.in
index 432db1b..542e6cf 100644
--- a/man/mkfs.btrfs.8.in
+++ b/man/mkfs.btrfs.8.in
@@ -5,15 +5,16 @@ mkfs.btrfs \- create an btrfs filesystem
 .B mkfs.btrfs
 [ \fB\-A\fP\fI alloc-start\fP ]
 [ \fB\-b\fP\fI byte-count\fP ]
-[ \fB \-d\fP\fI data-profile\fP ]
-[ \fB \-l\fP\fI leafsize\fP ]
-[ \fB \-L\fP\fI label\fP ]
-[ \fB \-m\fP\fI metadata profile\fP ]
-[ \fB \-M\fP\fI mixed data+metadata\fP ]
-[ \fB \-n\fP\fI nodesize\fP ]
-[ \fB \-s\fP\fI sectorsize\fP ]
-[ \fB \-h\fP ]
-[ \fB \-V\fP ] \fI device\fP [ \fI device ...\fP ]
+[ \fB\-d\fP\fI data-profile\fP ]
+[ \fB\-l\fP\fI leafsize\fP ]
+[ \fB\-L\fP\fI label\fP ]
+[ \fB\-m\fP\fI metadata profile\fP ]
+[ \fB\-M\fP\fI mixed data+metadata\fP ]
+[ \fB\-n\fP\fI nodesize\fP ]
+[ \fB\-s\fP\fI sectorsize\fP ]
+[ \fB\-h\fP ]
+[ \fB\-V\fP ]
+\fI device\fP [ \fIdevice ...\fP ]
 .SH DESCRIPTION
 .B mkfs.btrfs
 is used to create an btrfs filesystem (usually in a disk partition, or an array
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: don't panic if orphan item already exists

2011-12-14 Thread Phillip Susi

On 12/14/2011 9:58 AM, Josef Bacik wrote:

There is no underlying bug, there is a shitty situation, the shitty situation


Maybe my assumptions are wrong somewhere then.  You add the orphan item 
to make sure that the truncate will be finalized even if the system 
crashes before the transaction commits right?  So if truncate() fails 
with -ENOSPC, then you shouldn't be trying to finalize the truncate on 
the next mount, should you ( because the call did not succeed )?


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: don't panic if orphan item already exists

2011-12-14 Thread Phillip Susi

On 12/14/2011 10:27 AM, Josef Bacik wrote:

Except consider the case that the program was written intelligently and checks
for errors on truncate.  So he writes 100G, truncates to 50M, and the truncate
fails and he closes the file and exits.  Then somewhere down the road the inode
is evicted from cache and we reboot the box.  Next time the box comes up it only
looks like a 50M file, except we're still taking up 100G of disk space, and we
have no idea there's space there and it's still taken up in the allocator so it
will just look like we've lost ~100G of space.  This is why it's left there, so
everything can be cleaned up.


I'm a little confused here.  Is there a commit somewhere in there?  How 
can the 100g allocation be committed, but not the i_size of the inode? 
Shouldn't either both or neither be committed?  If both are committed, 
and then the truncate fails, then I would expect the system to come back 
up after a crash with the file still at 100g.  That is, as long as the 
orphan item is not left in place after the failed truncate.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: don't panic if orphan item already exists

2011-12-14 Thread Phillip Susi

On 12/14/2011 10:46 AM, Josef Bacik wrote:

file looks like its only 50m but still has 100g of extents taking up space
orphan cleanup happens and the inode is truncated and the extra space is cleaned
up


Yes, but isn't the only reason that the i_size change actually hit the 
disk is because of the orphan item?  So with no orphan item, the i_size 
remains at 100g.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: don't panic if orphan item already exists

2011-12-13 Thread Phillip Susi

On 12/13/2011 12:55 PM, Josef Bacik wrote:

I've been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in
a loop.  This is because we will add an orphan item, do the truncate, the
truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we're
left with an orphan item still in the fs.  Then we come back later to do another
truncate and it blows up because we already have an orphan item.  This is ok so
just fix the BUG_ON() to only BUG() if ret is not EEXIST.  Thanks,


Wouldn't it be better to fix the underlying bug, and remove the orphan 
item when the truncate fails?


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Chunk allocation size

2011-12-12 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

While poking around with btrfs-gui I noticed my fs had a fair number of quite 
small chunks ( especially metadata ), so I started looking into how they are 
allocated.  It appears that the current rule is to allocate:

1)  At most, 10% of the total fs capacity
2)  For metadata, at most 256 mb
3)  For data, at most 10gb, or 1gb per disk, whichever is lower

Why these values?  Why have hard coded sizes at all instead of just saying for 
instance, 4% of total capacity for metadata and 8% of total capacity for data 
chunks?  In my case, I had two 36 gb disks, so this led to data chunks maxing 
out at 2gb ( or ~2.8% ), and metadata chunks maxing out at 256mb ( or ~0.36% ). 
 It seems to me that ideal chunk sizes should be in the vicinity of 4%-10% of 
total capacity ( giving a total of 10-21 chunks ), never less than 1% ( giving 
more than 100 chunks ).
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7m1yUACgkQJ4UciIs+XuJ99wCfdhSvFB6S1uz+qTWBJotFoZ0d
6FwAoJuerIp9brqfv1E2PJfRsV7VDEbr
=FpRK
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: change resize ioctl to take device path instead of id

2011-12-11 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/11/2011 10:31 PM, Li Zefan wrote:
 Phillip Susi wrote:
 The resize ioctl took an optional argument that was a string
 representation of the devid which you wish to resize.  For
 the sake of consistency with the other ioctls that take a
 device argument, I converted this to take a device path instead
 of a devid number, and look up the number from the path.

 
 but.. isn't this an ABI change?

Technically no, since the ABI is just a string that may (undocumented) have a 
colon in it followed by digits.

 so instead of changing it, I think it's ok to extend it.

I considered that at first, but the existing code appears to not handle errors 
( what happens when the string can't be converted to an integer? ) and the 
interface has not been documented until now, so I figured may as well just get 
rid of it.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7leawACgkQJ4UciIs+XuIJ3ACeKuQidLKrb/nVqaS13Z1yzzoh
MDAAoIIPhBEnAbmTWdc6M4NBQUdX1+Pv
=7JV1
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cloning a Btrfs partition

2011-12-08 Thread Phillip Susi

On 12/7/2011 1:49 PM, BJ Quinn wrote:

What I need isn't really an equivalent zfs send -- my script can do
that. As I remember, zfs send was pretty slow too in a scenario like
this. What I need is to be able to clone a btrfs array somehow -- dd
would be nice, but as I said I end up with the identical UUID
problem. Is there a way to change the UUID of an array?


No, btrfs send is exactly what you need.  Using dd is slow because it 
copies unused blocks, and requires the source fs be unmounted and the 
destination be an empty partition.  rsync is slow because it can't take 
advantage of the btrfs tree to quickly locate the files (or parts of 
them) that have changed.  A btrfs send would solve all of these issues.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Don't prevent removal of devices that break raid reqs

2011-12-05 Thread Phillip Susi

On 11/10/2011 2:32 PM, Alexandre Oliva wrote:

Instead of preventing the removal of devices that would render existing
raid10 or raid1 impossible, warn but go ahead with it; the rebalancing
code is smart enough to use different block group types.

Should the refusal remain, so that we'd only proceed with a
newly-introduced --force option or so?


I just thought of something.  When adding the second device, balance 
converts DUP to RAID1 automatically, and it is the RAID1 that prevents 
removing the second disk.  What if the chunks were left with both the 
DUP and RAID1 flags set?  That way if you explicitly requested raid1, 
then it won't let you accidentally drop below two disks, but if it were 
auto promoted from DUP, then going back to DUP is ok.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resize command syntax wrong?

2011-12-01 Thread Phillip Susi

On 12/1/2011 1:46 AM, Helmut Hullen wrote:

balance != resize


I know.

p.e.
Start with 1 disk with 2 GB and 1 disk with 4 GByte
Fill it with 2 Gbyte data, each disk gets 1 GByte.

Add a disk with 10 GByte, run balance: each disk gets about 700 MByte.

That has nothing to do with resize.


Right, so why are you talking about balance when this thread is about resize?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Resize command syntax wrong?

2011-11-30 Thread Phillip Susi

Currently the resize command is under filesystem, and takes a path to the 
mounted filesystem.  This seems wrong to me.  Shouldn't it be under device, and 
take a path to a device to resize?  Otherwise, how can a resize operation when 
you have multiple devices make any sense?
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs/git question.

2011-11-28 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/28/2011 12:53 PM, Ken D'Ambrosio wrote:
 Seems I've picked up a wireless regression, and randomly drop my WiFi
 connection with more recent kernels.  While I'd love to try to track down the
 issue, the sporadic nature makes it difficult.  But I don't want to revert to 
 a
 flat-out old kernel because of all the btrfs modifications.  Is it possible
 using git to add *just* btrfs patches to an older kernel?

Sure: use git rebase to apply the patches to the older kernel.


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7UPDAACgkQJ4UciIs+XuLauQCgi9iTXZGD5BVTyTQJoc3Mm1R4
Oi8AmwX0oqwdF4e3dOTtAoUgFeYbKnOt
=i3i2
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wrong / too less space on raid

2011-11-27 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/27/2011 08:36 PM, Miao Xie wrote:
 This number is just a appraised number, not rigorous. It tells us
 how much space we can use to make up raid0 block groups which are
 used to store the file data.

More specifically, the available space that df reports is the amount of space 
available to store file data, which means it doesn't count the space reserved 
for metadata block groups, even though a not insignificant amount of that space 
may be free for metadata allocations.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7TB4AACgkQJ4UciIs+XuLdOACaA3hrErj82plyPOxXEoM+kHd3
5XkAnR3vR+0bhGlewjKTSEsVNZgc0mka
=UNVc
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[btrfs-progs PATCH 1/2] Removed extraneous whitespace from mkfs man page

2011-11-25 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

There were extra spaces around some of the arguments in the man
page for mkfs.

Signed-off-by: Phillip Susi ps...@cfl.rr.com
- ---
 man/mkfs.btrfs.8.in |   19 ++-
 1 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/man/mkfs.btrfs.8.in b/man/mkfs.btrfs.8.in
index 432db1b..542e6cf 100644
- --- a/man/mkfs.btrfs.8.in
+++ b/man/mkfs.btrfs.8.in
@@ -5,15 +5,16 @@ mkfs.btrfs \- create an btrfs filesystem
 .B mkfs.btrfs
 [ \fB\-A\fP\fI alloc-start\fP ]
 [ \fB\-b\fP\fI byte-count\fP ]
- -[ \fB \-d\fP\fI data-profile\fP ]
- -[ \fB \-l\fP\fI leafsize\fP ]
- -[ \fB \-L\fP\fI label\fP ]
- -[ \fB \-m\fP\fI metadata profile\fP ]
- -[ \fB \-M\fP\fI mixed data+metadata\fP ]
- -[ \fB \-n\fP\fI nodesize\fP ]
- -[ \fB \-s\fP\fI sectorsize\fP ]
- -[ \fB \-h\fP ]
- -[ \fB \-V\fP ] \fI device\fP [ \fI device ...\fP ]
+[ \fB\-d\fP\fI data-profile\fP ]
+[ \fB\-l\fP\fI leafsize\fP ]
+[ \fB\-L\fP\fI label\fP ]
+[ \fB\-m\fP\fI metadata profile\fP ]
+[ \fB\-M\fP\fI mixed data+metadata\fP ]
+[ \fB\-n\fP\fI nodesize\fP ]
+[ \fB\-s\fP\fI sectorsize\fP ]
+[ \fB\-h\fP ]
+[ \fB\-V\fP ]
+\fI device\fP [ \fIdevice ...\fP ]
 .SH DESCRIPTION
 .B mkfs.btrfs
 is used to create an btrfs filesystem (usually in a disk partition, or an array
- -- 
1.7.5.4

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7P0hUACgkQJ4UciIs+XuK6lgCcDVJyqpPS1tkZrFHO7gSwanZY
P+4An231BTNvoFgIIf52NOs1t/SjFTt5
=tSkT
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] Document --rootdir mkfs switch

2011-11-25 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

The --rootdir switch was not documented in the man page

Signed-off-by: Phillip Susi ps...@cfl.rr.com
- ---
 man/mkfs.btrfs.8.in |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/man/mkfs.btrfs.8.in b/man/mkfs.btrfs.8.in
index 542e6cf..25e817b 100644
- --- a/man/mkfs.btrfs.8.in
+++ b/man/mkfs.btrfs.8.in
@@ -12,6 +12,7 @@ mkfs.btrfs \- create an btrfs filesystem
 [ \fB\-M\fP\fI mixed data+metadata\fP ]
 [ \fB\-n\fP\fI nodesize\fP ]
 [ \fB\-s\fP\fI sectorsize\fP ]
+[ \fB\-r\fP\fI rootdir\fP ]
 [ \fB\-h\fP ]
 [ \fB\-V\fP ]
 \fI device\fP [ \fIdevice ...\fP ]
@@ -59,6 +60,9 @@ Specify the nodesize. By default the value is set to the 
pagesize.
 \fB\-s\fR, \fB\-\-sectorsize \fIsize\fR
 Specify the sectorsize, the minimum block allocation.
 .TP
+\fB\-r\fR, \fB\-\-rootdir \fIrootdir\fR
+Specify a directory to copy into the newly created fs.
+.TP
 \fB\-V\fR, \fB\-\-version\fR
 Print the \fBmkfs.btrfs\fP version and exit.
 .SH AVAILABILITY
- -- 
1.7.5.4

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7P0lcACgkQJ4UciIs+XuKpMQCffecm37bfSbb8vqJe5hzmeuvU
n08AoJvHiTcO9k8M0g9k4TC7iGgaLZlN
=yE4p
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Subvolume Quota on-disk structures and configuration

2011-11-25 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/10/2011 04:21 AM, Arne Jansen wrote:
 Now that I've got a working prototype of subvolume quota, I'd like
 to get some feedback on the on-disk structure and the commands I
 intend to use.

I think I've noticed a bug so far, and have one comment on the qgroup show 
command.  The command seems to show the current usage of the qgroup, but I 
can't see how to view the limits ( if any ).  It seems like the show command 
should show both.

The bug I seem to have noticed is that rm fails with EQUOTA.  I set a 1g limit 
on a new subvol, and ran dd if=/dev/zero of=/mnt/foo, which created a file 
approx 1g in size before erroring out with EQUOTA.  After that, I did an echo 
bar /mnt/bar, and to my surprise, this did not fail with EQUOTA.  Now when I 
try to rm /mnt/bar or /mnt/foo, THAT fails with EQUOTA.  I also got this in 
dmesg:

[  992.078275] WARNING: at fs/btrfs/inode.c:6670 
btrfs_destroy_inode+0x31d/0x360 [btrfs]()
[  992.078276] Hardware name: System Product Name
[  992.078277] Modules linked in: nls_utf8 isofs bnep rfcomm kvm_intel kvm 
parport_pc ppdev dm_crypt binfmt_misc nls_iso8859_1 nls_cp437 vfat fat 
snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep 
snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event joydev snd_seq psmouse 
eeepc_wmi asus_wmi snd_timer snd_seq_device btusb bluetooth serio_raw snd 
sparse_keymap soundcore mei(C) snd_page_alloc w83627ehf hwmon_vid coretemp lp 
parport raid10 raid456 async_pq async_xor async_memcpy async_raid6_recov 
raid6_pq async_tx raid1 raid0 multipath linear dm_raid45 xor dm_mirror 
dm_region_hash dm_log btrfs zlib_deflate libcrc32c hid_microsoft usbhid hid 
mxm_wmi wmi radeon ahci libahci ttm drm_kms_helper e1000e xhci_hcd drm 
i2c_algo_bit zram(C)
[  992.078305] Pid: 2342, comm: rm Tainted: G C   3.2.0-rc2+ #7
[  992.078306] Call Trace:
[  992.078311]  [81062acf] warn_slowpath_common+0x7f/0xc0
[  992.078313]  [81062b2a] warn_slowpath_null+0x1a/0x20
[  992.078320]  [a020de9d] btrfs_destroy_inode+0x31d/0x360 [btrfs]
[  992.078324]  [811895cc] destroy_inode+0x3c/0x70
[  992.078326]  [8118972a] evict+0x12a/0x1c0
[  992.078328]  [811898c9] iput+0x109/0x220
[  992.078331]  [8117e7b3] do_unlinkat+0x153/0x1d0
[  992.078333]  [811740ea] ? sys_newfstatat+0x2a/0x40
[  992.078334]  [8117f352] sys_unlinkat+0x22/0x40
[  992.078337]  [8160d0c2] system_call_fastpath+0x16/0x1b
[  992.078338] ---[ end trace 770bc93001697fbc ]---

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7QZ6AACgkQJ4UciIs+XuL8PwCfQH+oKQ+AJNu5+mKXPdT4byX2
6BcAoKrAgii/ljRs/0lbk4AExbolurXA
=1lN4
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs and load (sys)

2011-11-23 Thread Phillip Susi

On 11/23/2011 9:43 AM, krz...@gmail.com wrote:

What all those btrfs benchmark does not tell you that its performance
decreases (sys load increases) with growing size of btree. Creating
btrfs filesystem is instantaneous because initial tree is just
nothing...


While something is clearly wrong, this isn't it.  Each snapshot is its 
own btree, and you said there is little churn each day, so they aren't 
getting significantly larger over time.  Each snapshot is tracked in the 
tree of tree roots, so technically it is growing each time you take a 
snapshot, but 60 items is nothing.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Removed extraneous whitespace from mkfs man page

2011-11-23 Thread Phillip Susi

There were extra spaces around some of the arguments in the man
page for mkfs.
---
 man/mkfs.btrfs.8.in |   19 ++-
 1 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/man/mkfs.btrfs.8.in b/man/mkfs.btrfs.8.in
index 432db1b..542e6cf 100644
--- a/man/mkfs.btrfs.8.in
+++ b/man/mkfs.btrfs.8.in
@@ -5,15 +5,16 @@ mkfs.btrfs \- create an btrfs filesystem
 .B mkfs.btrfs
 [ \fB\-A\fP\fI alloc-start\fP ]
 [ \fB\-b\fP\fI byte-count\fP ]
-[ \fB \-d\fP\fI data-profile\fP ]
-[ \fB \-l\fP\fI leafsize\fP ]
-[ \fB \-L\fP\fI label\fP ]
-[ \fB \-m\fP\fI metadata profile\fP ]
-[ \fB \-M\fP\fI mixed data+metadata\fP ]
-[ \fB \-n\fP\fI nodesize\fP ]
-[ \fB \-s\fP\fI sectorsize\fP ]
-[ \fB \-h\fP ]
-[ \fB \-V\fP ] \fI device\fP [ \fI device ...\fP ]
+[ \fB\-d\fP\fI data-profile\fP ]
+[ \fB\-l\fP\fI leafsize\fP ]
+[ \fB\-L\fP\fI label\fP ]
+[ \fB\-m\fP\fI metadata profile\fP ]
+[ \fB\-M\fP\fI mixed data+metadata\fP ]
+[ \fB\-n\fP\fI nodesize\fP ]
+[ \fB\-s\fP\fI sectorsize\fP ]
+[ \fB\-h\fP ]
+[ \fB\-V\fP ]
+\fI device\fP [ \fIdevice ...\fP ]
 .SH DESCRIPTION
 .B mkfs.btrfs
 is used to create an btrfs filesystem (usually in a disk partition, or an array
--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] Document --rootdir mkfs switch

2011-11-23 Thread Phillip Susi

---
 man/mkfs.btrfs.8.in |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/man/mkfs.btrfs.8.in b/man/mkfs.btrfs.8.in
index 542e6cf..25e817b 100644
--- a/man/mkfs.btrfs.8.in
+++ b/man/mkfs.btrfs.8.in
@@ -12,6 +12,7 @@ mkfs.btrfs \- create an btrfs filesystem
 [ \fB\-M\fP\fI mixed data+metadata\fP ]
 [ \fB\-n\fP\fI nodesize\fP ]
 [ \fB\-s\fP\fI sectorsize\fP ]
+[ \fB\-r\fP\fI rootdir\fP ]
 [ \fB\-h\fP ]
 [ \fB\-V\fP ]
 \fI device\fP [ \fIdevice ...\fP ]
@@ -59,6 +60,9 @@ Specify the nodesize. By default the value is set to the 
pagesize.
 \fB\-s\fR, \fB\-\-sectorsize \fIsize\fR
 Specify the sectorsize, the minimum block allocation.
 .TP
+\fB\-r\fR, \fB\-\-rootdir \fIrootdir\fR
+Specify a directory to copy into the newly created fs.
+.TP
 \fB\-V\fR, \fB\-\-version\fR
 Print the \fBmkfs.btrfs\fP version and exit.
 .SH AVAILABILITY
--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Subvolume Quota on-disk structures and configuration

2011-11-22 Thread Phillip Susi

On 11/21/2011 3:15 PM, Arne Jansen wrote:

I can rebase it to the current for-linus and push it out later today.



git://git.kernel.org/pub/scm/linux/kernel/git/arne/linux-btrfs.git qgroups

just waiting for the replication to the mirrors...


What about the btrfs-progs changes to add the commands?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Show Chunks by position

2011-11-22 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/18/2011 10:50 AM, Phillip Susi wrote:
 This is a nice little tool.  The one suggestion that I have is that
 it display the actual chunks and where they are located.  It seems
 that right now it uses the same ioctl that btrfs fi df uses, and that
 just gets the total combined size for all chunks of a given type.  It
 would be nice if the gui actually parsed the chunk tree and showed
 each individual chunk with lines showing how they are mapped to the
 physical disks.

I managed to cobble together the following patches to implement this.
The one thing that I still don't like is that the radio knob is in the
data replication and allocation box, instead of the volumes box.  My
python and tkinter skills are too lacking to figure out how to move it
down there.


Phillip Susi (3):
  Changed volume_df() to return all chunks with their offsets
  Update UsageDisplay to be capable of displaying all chunks by
position
  Add radio knob to show space by position or combined

 btrfsgui/gui/usagedisplay.py |   98
--
 btrfsgui/hlp/size.py |   18 +++
 2 files changed, 73 insertions(+), 43 deletions(-)

- -- 
1.7.5.4

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7MSIoACgkQJ4UciIs+XuLyjQCeI4m7+u75R863B2RY3hkFELbP
iJ8AoJwzVdiYZqgE1tXvHEOHz+gciDgj
=dfd1
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Changed volume_df() to return all chunks with their offsets

2011-11-22 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

volume_df() used to return a structure containing a dictionary named
usage that contained 3 chunks, keyed by their usage type ( sys, meta,
data ).  I changed this to an array named chunks that contains one entry
for every chunk found on the disk.  Each chunk still is a dictionary that
contains flags, size, and used, but now also contains voffset and poffset
entries giving their starting offset relative to the filesystem and disk
respectively.  The poffset is intended to be used to show the chunk at
the correct position in the disk graph, and the voffset is intended to
allow correlation with the filesystem graph.
- ---
 btrfsgui/gui/usagedisplay.py |   14 ++
 btrfsgui/hlp/size.py |   18 --
 2 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/btrfsgui/gui/usagedisplay.py b/btrfsgui/gui/usagedisplay.py
index ccfc148..e6ae9b4 100644
- --- a/btrfsgui/gui/usagedisplay.py
+++ b/btrfsgui/gui/usagedisplay.py
@@ -347,13 +347,19 @@ class UsageDisplay(Frame, Requester):
max_space = 0
for dev in self.fs[vols]:
rv, text, obj = self.request(vol_df, self.fs[uuid], 
dev[id])
- - dev[usage] = obj
+   dev[vol_df] = obj
if obj[size]  max_space:
max_space = obj[size]
 
y = 4
for i, dev in enumerate(self.fs[vols]):
- - obj = dev[usage]
+   # Combine chunks with same flags
+   chunks = {}
+   for chunk in dev[vol_df][chunks]:
+   if chunk[flags] in chunks:
+   chunks[chunk[flags]][size] += 
chunk[size]
+   chunks[chunk[flags]][used] += 
chunk[used]
+   else: chunks[chunk[flags]] = chunk
frame = LabelFrame(self.per_disk,
   text=dev[path])
frame.grid(sticky=N+S+E+W)
@@ -371,8 +377,8 @@ class UsageDisplay(Frame, Requester):
bbox = self.per_disk.bbox(container)
y = bbox[3] + 4
raw_free += self.create_usage_box(canvas,
- - 
  obj[usage].values(),
- - 
  size=obj[size],
+   
  chunks.values(),
+   
  size=dev[vol_df][size],

  max_size=max_space)
self.per_disk.configure(
scrollregion=(0, 0,
diff --git a/btrfsgui/hlp/size.py b/btrfsgui/hlp/size.py
index 0ec98c3..5ac89f4 100644
- --- a/btrfsgui/hlp/size.py
+++ b/btrfsgui/hlp/size.py
@@ -69,7 +69,7 @@ def volume_df(params):
res[size] = data[1]
res[used] = data[2]
res[uuid] = btrfs.format_uuid(data[12])
- - res[usage] = {}
+   res[chunks] = []
 
# Now, collect data on the block group types in use
last_offset = 0
@@ -124,20 +124,18 @@ def volume_df(params):
if header[2] != chunk_length:
raise HelperException(Chunk length 
inconsistent: chunk tree says {0} bytes, extent tree says {1} 
bytes.format(chunk_length, header[2]))
chunk_used = extent_info[0]
- - 
  
- - if chunk_type not in res[usage]:
- - res[usage][chunk_type] = {
- - flags: chunk_type,
- - size: 0,
- - used: 0,
- - }
- - res[usage][chunk_type][size] += ext_length
# We have a total of chunk_used space used, out 
of
# chunk_length in this block group. So
# chunk_used/chunk_length is the proportion of 
the BG
# used. We multiply that by the length of the 
dev_extent
# to get the amount of space used in the 
dev_extent.
- - res[usage][chunk_type][used] += chunk_used 
* ext_length / chunk_length
+   res[chunks].append({
+ 

[PATCH 2/3] Update UsageDisplay to be capable of displaying all chunks by position

2011-11-22 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Previously the input_data to create_usage_box was assumed to be a list of
3 chunks, one of each type: data, meta, sys.  Now the list can contain any
number of entries and they will each be displayed.  If the entries contain
the key offset, then they will be shown at the appropriate location based
on that offset, with any gaps filled in by unused space ( they are thus
assumed to be in order ).  Without the offset key, they will be displayed
in order, with no gaps.
- ---
 btrfsgui/gui/usagedisplay.py |   65 ++---
 1 files changed, 35 insertions(+), 30 deletions(-)

diff --git a/btrfsgui/gui/usagedisplay.py b/btrfsgui/gui/usagedisplay.py
index e6ae9b4..aff24da 100644
- --- a/btrfsgui/gui/usagedisplay.py
+++ b/btrfsgui/gui/usagedisplay.py
@@ -243,52 +243,48 @@ class UsageDisplay(Frame, Requester):
'size' or 'free' should be provided. If 'size' is set, the
amount of free space computed is returned; otherwise the
return value is arbitrary.
- -
# Calculate the overall width of the box we are going to draw
width = DF_BOX_WIDTH
if max_size is not None:
width = width * size / max_size
- -
+   box = SplitBox(orient=SplitBox.HORIZONTAL)
+   nextpos = 0
# Categorise the data
- - data = SplitBox(orient=SplitBox.VERTICAL)
- - meta = SplitBox(orient=SplitBox.VERTICAL)
- - sys = SplitBox(orient=SplitBox.VERTICAL)
- - freebox = SplitBox(orient=SplitBox.VERTICAL)
- - for bg_type in input_data:
- - repl = btrfs.replication_type(bg_type[flags])
- - usage = btrfs.usage_type(bg_type[flags])
+   for chunk in input_data:
+   if not offset in chunk:
+   chunk[offset] = nextpos
+   if nextpos = chunk[offset]:
+   freesize = chunk[offset] - nextpos
+   freebox = SplitBox(orient=SplitBox.VERTICAL)
+   freebox.append((freesize, { fill: 
COLOUR_UNUSED }))
+   nextpos = chunk[offset] + chunk[size]
+   if size is not None:
+   size -= chunk[size]
+   chunkbox = SplitBox(orient=SplitBox.VERTICAL)
+
+   repl = btrfs.replication_type(chunk[flags])
+   usage = btrfs.usage_type(chunk[flags])
 
if usage == data:
- - destination = data
col = COLOURS[repl][0]
if usage == meta:
- - destination = meta
col = COLOURS[repl][1]
if usage == sys:
- - destination = sys
col = COLOURS[repl][2]
 
usedfree = SplitBox(orient=SplitBox.HORIZONTAL)
- - usedfree.append((bg_type[used],
+   usedfree.append((chunk[used],
 { fill: col }))
- - usedfree.append((bg_type[size]-bg_type[used],
+   usedfree.append((chunk[size]-chunk[used],
 { fill: col, 
stripe: fade(col) }))
- - destination.append((usedfree.total, usedfree))
- - if size is not None:
- - size -= bg_type[size]
+   chunkbox.append((usedfree.total, usedfree))
+   box.append((chunkbox.total, chunkbox))
 
- - if size is not None:
- - freebox.append((size, { fill: COLOUR_UNUSED }))
- - elif free is not None:
+   if size is not None and nextpos  size:
+   free = size - nextpos
+   if free is not None:
+   freebox = SplitBox(orient=SplitBox.VERTICAL)
freebox.append((free, { fill: COLOUR_UNUSED }))
- -
- - # total is our whole block
- - # *_total are the three main divisions
- - box = SplitBox(orient=SplitBox.HORIZONTAL)
- - box.append((sys.total, sys))
- - box.append((meta.total, meta))
- - box.append((data.total, data))
- - if size is not None or free is not None:
box.append((freebox.total, freebox))
 
box.set_position(DF_BOX_PADDING, DF_BOX_PADDING,
@@ -329,6 +325,15 @@ class UsageDisplay(Frame, Requester):
 
@ex_handler
def update_display(self):
+   def usage_sort(chunk):
+   usage = btrfs.usage_type(chunk[flags])
+

[PATCH 3/3] Add radio knob to show space by position or combined

2011-11-22 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

The previous method was to show chunks combined by type.  This knob
allows the user to choose to show each individual chunk in its actual
position.
- ---
 btrfsgui/gui/usagedisplay.py |   35 ---
 1 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/btrfsgui/gui/usagedisplay.py b/btrfsgui/gui/usagedisplay.py
index aff24da..fd8b1e2 100644
- --- a/btrfsgui/gui/usagedisplay.py
+++ b/btrfsgui/gui/usagedisplay.py
@@ -213,6 +213,22 @@ class UsageDisplay(Frame, Requester):
but.grid(row=2, column=1, sticky=W)
self.df_selection.set(alloc)
 
+   self.df_byposition = StringVar()
+   Label(box, text=Show allocated space).grid(row=3,column=0)
+   but = Radiobutton(
+   box, text=Combined,
+   variable=self.df_byposition,
+   command=self.change_display,
+   value=combined)
+   but.grid(row=3, column=1, sticky=W)
+   but = Radiobutton(
+   box, text=Individual,
+   variable=self.df_byposition,
+   command=self.change_display,
+   value=individual)
+   but.grid(row=4, column=1, sticky=W)
+   self.df_byposition.set(combined)
+
frm = LabelFrame(self, text=Volumes)
frm.columnconfigure(0, weight=1)
frm.rowconfigure(0, weight=1)
@@ -359,12 +375,17 @@ class UsageDisplay(Frame, Requester):
y = 4
for i, dev in enumerate(self.fs[vols]):
# Combine chunks with same flags
- - chunks = {}
- - for chunk in dev[vol_df][chunks]:
- - if chunk[flags] in chunks:
- - chunks[chunk[flags]][size] += 
chunk[size]
- - chunks[chunk[flags]][used] += 
chunk[used]
- - else: chunks[chunk[flags]] = chunk
+   if self.df_byposition.get() == combined:
+   chunks = {}
+   for chunk in dev[vol_df][chunks]:
+   if chunk[flags] in chunks:
+   chunks[chunk[flags]][size] 
+= chunk[size]
+   chunks[chunk[flags]][used] 
+= chunk[used]
+   else: chunks[chunk[flags]] = chunk
+   chunks = sorted(chunks.values(), key=usage_sort)
+   else:
+   chunks = [{size: x[size], used: 
x[used], flags: x[flags],
+  offset: x[poffset]} for 
x in dev[vol_df][chunks]]
frame = LabelFrame(self.per_disk,
   text=dev[path])
frame.grid(sticky=N+S+E+W)
@@ -382,7 +403,7 @@ class UsageDisplay(Frame, Requester):
bbox = self.per_disk.bbox(container)
y = bbox[3] + 4
raw_free += self.create_usage_box(canvas,
- - 
  sorted(chunks.values(), key=usage_sort),
+   
  chunks,

  size=dev[vol_df][size],

  max_size=max_space)
self.per_disk.configure(
- -- 
1.7.5.4

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7MSKIACgkQJ4UciIs+XuIx5QCfSHc4/8bPkQuiTGs0R3D6SyPK
6+cAn1n7HcLowmobT48hQc+iPjUqJEer
=Ljl6
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Subvolume Quota on-disk structures and configuration

2011-11-21 Thread Phillip Susi

On 7/10/2011 4:21 AM, Arne Jansen wrote:

btrfs qgroup limit [--exclusive] size|none qgroupid path

This sets actual limits on a qgroup. If --exclusive is given, the
exclusive usage is limited instead of the referenced. I don't know
if there are use cases where both values need a (possibly different)
limit. path means the path to the root. Instead of qgroupid
path, a path to a subvolume can be given instead.

btrfs qgroup create qgroupid path
btrfs qgroup destroy qgroupid path
btrfs qgroup assign childid parentid path
btrfs qgroup remove childid parentid path

These 4 commands are used to build hierarchical qgroups and are only
for advanced users. I'll explain more of the concepts in a later
paper.

The main point here is that in the simplest case, a user creates a
filesystem with initial quota support, creates his /var /usr /home
etc. subvolumes and limits them with commands like

btrfs qgroup limit 10g /usr

That should be simple enough for the common use case.


Wouldn't that make the syntax above actually be:

btrfs qgroup limit [--exclusive] size|none [qgroupid] path

Since the qgroupid is optional?  And the meaning of path depends on 
whether or not qgroupid is specified.  With qgroupid, path is anywhere 
on the fs, but without it, it specifies the path of the implicit 
qgroupid, right?


I also have a question about the interactions with groups of groups. 
Say I have 4 subvolumes: 1, 2, 3, and Z.  I group the first 3 volumes 
and set a limit on them.  Now if all 3 volumes share a chunk of space, 
that space should only count towards the group once, rather than 3 
times.  You might think the solution to that is to use the exclusive 
limits, but that would mean that any space volume 3 and volume Z share 
would not be counted in the group at all.  I don't care about volume Z 
since it is not part of the group, yet it can influence the used space 
of the group.  Likewise, if I set an exclusive limit on the group, and 
then create snapshot Y from subvol 2, that would significantly reduce 
the exclusive charge for the group, and we don't want that.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Subvolume Quota on-disk structures and configuration

2011-11-21 Thread Phillip Susi

On 11/21/2011 12:20 PM, Arne Jansen wrote:

On 11/21/2011 05:06 PM, Phillip Susi wrote:

On 7/10/2011 4:21 AM, Arne Jansen wrote:

btrfs qgroup limit [--exclusive]size|noneqgroupid  path


btrfs qgroup limit 10g /usr

That should be simple enough for the common use case.


Wouldn't that make the syntax above actually be:

btrfs qgroup limit [--exclusive]size|none  [qgroupid]path


You don't mean to actually changing the syntax, but adding a better
explanation or a more precise usage?


What I mean is that your syntax listed groupid in angle brackets, 
indicating that it is a required argument, but your description seems to 
indicate that it is optional, so it should be in square brackets.  Also 
the size bit I assume was meant to be a required parameter that should 
be either a number or the word none, so the angle brackets should 
enclose the |none part too.



I also have a question about the interactions with groups of groups. Say
I have 4 subvolumes: 1, 2, 3, and Z.  I group the first 3 volumes and
set a limit on them.  Now if all 3 volumes share a chunk of space, that
space should only count towards the group once, rather than 3 times.


It's just what groups are made for. In your scenario the chunk of space
would count only once. Some hopefully better explanation can be found at


Ohh, so the group is a union of the chunks in the members, not a sum? 
So if you set an exclusive limit on the group, that would count 
everything shared between 1, 2, 3 once, and not count any shared with Z? 
 In other words, --exclusive excludes space shared with things outside 
the group, not within it?



http://sensille.com/qgroups.pdf

Have you already played with the patchset?


Not yet; I just found it today from the new thread on the subject, and 
look forward to playing with it tonight.  I was wondering what revision 
the patches are based on, and are they in a public git repo?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Announcing btrfs-gui

2011-11-18 Thread Phillip Susi

On 6/1/2011 7:20 PM, Hugo Mills wrote:

Over the last few weeks, I've been playing with a foolish idea,
mostly triggered by a cluster of people being confused by btrfs's free
space reporting (df vs btrfs fi df vs btrfs fi show). I also wanted an
excuse, and some code, to mess around in the depths of the FS data
structures.

Like all silly ideas, this one got a bit out of hand, and seems to
have turned into something vaguely useful. I'm therefore pleased to
announce the first major public release of btrfs-gui[1]: a point-and-
click tool for managing btrfs filesystems.


This is a nice little tool.  The one suggestion that I have is that it 
display the actual chunks and where they are located.  It seems that 
right now it uses the same ioctl that btrfs fi df uses, and that just 
gets the total combined size for all chunks of a given type.  It would 
be nice if the gui actually parsed the chunk tree and showed each 
individual chunk with lines showing how they are mapped to the physical 
disks.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] improve space utilization on off-sized raid devices

2011-11-17 Thread Phillip Susi

On 11/17/2011 7:59 AM, Arne Jansen wrote:

Right you are. So you want to sacrifice stripe size for space efficiency.
Why don't you just use RAID1?
Instead of reducing the stripe size for the majority of writes, I'd prefer
to allow RAID10 to go down to 2 disks. This should also solve it.


Yes, it appears that btrfs's current idea of raid10 is actually raid0+1, 
not raid10.  If it were proper raid10, it could use the remaining space 
on the 3 larger disks for a raid10 metadata chunk.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/21] [RFC] Btrfs: restriper

2011-11-16 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 08/23/2011 04:01 PM, Ilya Dryomov wrote:
 Hello,
 
 This patch series adds an initial implementation of restriper (it's
 a clever name for relocation framework that allows to do selective
 profile changing and selective balancing with some goodies like
 pausing/resuming and reporting progress to the user.
 
 Profile changing is global (per-FS) so far, per-subvolume profiles 
 require some discussion and can be implemented in future.  This is
 a RFC so some features/problems are not yet implemented/resolved.
 The current TODO list is as follows:

I managed to use these patches to convert the raid1 system and
metadata chunks back to single and drop the second disk from a two
disk array.  In doing so I noticed that the restriper required a force
switch to downgrade raid1 to single.  This seems completely
unnecessary to me.  A force switch to btrfs device delete might make
sense since delete may or may not force a downgrade, but with
restripe, the request to convert from raid1 to single is already quite
explicit with no room for ambiguity, so there should be no need for an
additional confirmation switch.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7Ee+oACgkQJ4UciIs+XuIGIQCdFx9cP7cPQPslE9IcFNDg/6Ns
LQYAn2l2ykGwiJt/yZNvuqePyMj3sxYH
=P+HR
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/21] [RFC] Btrfs: restriper

2011-11-15 Thread Phillip Susi

On 11/15/2011 4:22 AM, Ilya Dryomov wrote:

Restriper won't let you do raid1 -  dup transition because dup is only
allowed for a single-spindle FS, so you'll end up with error btrfs:
unable to start restripe 

There is no way to prioritize disks during restripe.  To get dup back
you'll have to convert everything to single, remove the second drive and
then convert metadata from single to dup.


So there is no way to put a disk into read only mode and prevent 
allocations of new chunks there?


It seems like both of these limitations are highly undesirable when 
trying to recover from a failing disk.  You don't want any more data 
being written to the failing disk while you are trying to remove it, and 
you certainly don't want to drop back to a single copy of data that is 
then written to the failing disk.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs progs git repo on kernel.org

2011-11-15 Thread Phillip Susi

On 10/27/2011 11:27 AM, Chris Mason wrote:

Hi everyone,

I've pulled in Hugo's integration tree, minus the features that were not
yet in the kernel.  This also has a few small commits that I had queued
up outside of the fsck work.

Hugo, many thanks for keeping up the integration tree!  Taking out the
features not in the kernel meant I had to rebase it the commits, I'm
sorry about that.

The code from the integration tree is here:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git


I notice that there are no tags in the repo.  Did you just forget to 
push them, or have they been lost?  Also the repository description 
still needs filled out.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Bird's eye view of relocation trees

2011-11-14 Thread Phillip Susi
Can someone give a bird's eye view of what relocation trees are and how 
they are used?  I've been looking over the code all morning and can only 
see that it appears to be a normal root tree, but with a different 
objectid for some reason.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/21] [RFC] Btrfs: restriper

2011-11-14 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I have a fs that started with the default policy of metadata=dup.  I
added a second device and rebalanced, and so the metadata chunks were
converted to raid1.  Now I can not remove the second device because
raid1 requires at least two devices.

If I understand this patch series correctly, I can use it to manually
convert those raid1 chunks back to dup, and then remove the second
device.  It occurs to me though, that in the restripe process, the
newly created dup chunks can be allocated from either disk still, and
any that are allocated on the second disk will then need to be
relocated in order to remove that disk.  This seems inefficient, so I
was wondering if there is a way to make sure that during the restripe,
only the disk I intend to keep is allocated from to create the dup
chunks, and thus avoid the need to relocate when I remove the second disk?

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7Bq0wACgkQJ4UciIs+XuLoUACeMkb4Pd0zshDDKmVzibYtxmvX
GewAnAwKcsCaCaAX2XK6oMWxK6FvZQFc
=UxDl
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: grub-1.99 and btrfs-only

2011-11-07 Thread Phillip Susi

On 11/5/2011 10:02 PM, Chuck Burns wrote:

These are my current subvolumes:
# btrfs sub list /
ID 256 top level 5 path mainroot
ID 257 top level 5 path home

I have sub 256 set as default, and then home is mounted onto mainroot.


I advise against using set-default at all.  The setup Ubuntu seems to be 
going for ( and is working well for me so far ) creates two subvolumes 
under the default root, named @ and @home, and intended to be mounted in 
/ and /home respectively.  The /@ subvolume is then mounted as the root 
via rootflags=subvol=@ argument, and grub is configured to use 
/@/boot/grub as its prefix directory.


I'm still getting a sparse file not allowed error during boot, and 
have to press enter to continue though.  Still not tracked that one down.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: understanding the tree-log

2011-11-04 Thread Phillip Susi

On 11/4/2011 1:09 AM, Liu Bo wrote:

Btrfs has an expensive commit transaction, if we commit a transaction every 
time we fsync,
the performance is not that good.  Instead of this, we introduce a write-ahead 
log to make
our fsync faster.

So if you do fsync for your data, it means your data is safely in the log tree,
then if a crash takes place, the data can be recovered from log.


How can you write to the log tree without a full commit?  The tree of 
tree roots has to point to the root node of the log tree, so when you 
write out the log tree, that needs updated too, which requires a full 
commit doesn't it?


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs-tools source code

2011-10-26 Thread Phillip Susi
It still doesn't appear to have returned to kernel.org.  Should that 
happen sometime soon, or is it available somewhere else now?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Where's the superblock allocation?

2011-10-26 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

After a fresh mkfs.btrfs, I'm trying to understand the data structures,
and I'm a little confused about what keeps the boot sector from being
allocated to a file.

According to the device tree, the first 4mb of the disk are mapped
directly to the first 4mb of the chunk space:

item 0 key (1 DEV_EXTENT 0) itemoff 3947 itemsize 48
dev extent chunk_tree 3
chunk objectid 256 chunk offset 0 length 4194304

And the chunk tree seems to agree:

item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 0) itemoff 3817 itemsize 80
chunk length 4194304 owner 2 type 2 num_stripes 1
stripe 0 devid 1 offset 0

But the only entry I find in the extent tree for offset 0 is:

item 0 key (0 BLOCK_GROUP_ITEM 4194304) itemoff 3971 itemsize 24
block group used 0 chunk_objectid 256 flags 2

So it appears that the first 4mb of the disk are part of a block group,
and up for allocation whenever needed.  Why isn't there an entry in the
extent tree marking the first few kb as reserved for the superblock, or
alternatively, the chunk map starting at a non zero disk offset?

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6oihUACgkQJ4UciIs+XuJ0EwCfYrWbAQRy7BP2Ogmvrn/pBW0y
D/wAnibm4TqPV1PyqLi2H0Vain1ftW5Q
=miI9
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Snapshot rollback

2011-10-25 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 10/24/2011 10:04 PM, Arand Nash wrote:
 Btrfs is unfortunately unable to look for snapshots by name above the
 currently set default root (I do not know why exectly), it can however
 find them by id anywhere.

Ok, so looking up subvols by name uses the default subvol to resolve the
name, and so when I change the default subvol to the snapshot of @,
there is no @home name there pointing to the subvol?  Things make much
more sense knowing that.  I thought that the subvolumes had their own
namespace outside of any one subvolume.

Is there a way to create another name entry in @snap that points to
@home, or can you only have the original @home entry in the default subvol?

 To backup
 ~# mount /dev/sda1 /mnt
 ~# ls /mnt
 @ @home
 ~@ btrfs sub snap /mnt/@ /mnt/@rootsnap
 ~# ls /mnt
 @ @home @rootsnap
 
 And to rollback:
 ~# mv /mnt/@ /mnt/@rootmessy
 ~# mv /mnt/@rootsnap /mnt/@
 And just reboot, since it just mounts whatever is named @/.

Perfect... I think I'll keep the default subvol mounted under /.subvols
to make managing them easy.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6nYtIACgkQJ4UciIs+XuJ/CQCgsbMLAY/h9opq/T7qBJKKrSz2
v0cAnRq8PKp1jx9r4Q6X4J6Ixjv3KpeS
=GSCn
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Snapshot rollback

2011-10-24 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 10/24/2011 01:45 AM, dima wrote:
 Hello Phillip,
 It is hard to judge without seeing your fstab and bootloader config. Maybe 
 your
 / was directly in subvolid=0 without creating a separate subvolume for it 
 (like
 __active in Goffredo's reply)? In my very humble opinion, if you have your 
 @home
 subvolume under subvolid=0 and then change the default subvolume, it just 
 cannot
 access your @home any more.

Why can't it?

It appears that Ubuntu sets up two subvols, one named @ and one named
@home, and mounts them at / and /home respectively.  The boot loader was
set to pass rootflags=subvol=@.  After changing the default volume, the
system would not boot until I removed that rootflags argument, then it
mounted the snapshot correctly as the root, but refused to mount /home,
giving this nonsense error that /dev/sda1 is not a valid block device.

 Here is a very good article that explains the working of subvolumes. I used it
 as reference a lot.
 http://www.funtoo.org/wiki/BTRFS_Fun#Using_snapshots_for_system_recovery_.28aka_Back_to_the_Future.29

This advice seems completely goofy.  It tells you to change the default
subvol and boot from the snapshot, but then to have rsync copy all of
the files back to the default volume, then switch back to using that.
This seems to defeat the entire purpose.  If you are already booting
from the snapshot, why would you want to waste time copying the files
back to the original subvol instead of just deleting it, and using the
snapshot volume from now on?

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6mBo4ACgkQJ4UciIs+XuK+wgCeOD0km3GpdseQ0h4y0FKSI7JS
xC0An2JqA4aOHCkZ7+g+TORunVnpmQj7
=6KKf
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Subvolume level allocation policy

2011-10-23 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Is it ( yet? ) possible to manipulate the allocation policy on a
subvolume level instead of the fs level?  For example, to make / use
raid1, and /home use raid0?  Or to have / allocated from an ssd and
/home allocated from the giant 2tb hd.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6kYgYACgkQJ4UciIs+XuISPQCglUtPmg4GMrrY53Fkafk2fkcA
E84AoMAZXBha/fDrk6moMKPzbMYEtLci
=BIdX
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-09 Thread Phillip Susi

On 01/09/2011 01:56 PM, Thomas Bellman wrote:

That particular problem was solved with the introduction of the
rename(2) system call in 4.2BSD a bit more than a quarter of a
century ago. There is no need to introduce another, less flexible,
API for doing the same thing.


I'm curious if there are any BSD specifications that state that rename() 
has this behavior.  Ted Tso has been claiming that POSIX does not 
require this behavior in the face of a crash and that as a result, an 
application that relies on such behavior is broken, and needs to fsync() 
before rename().  This of course, makes replacing numerous files much 
slower, glacially so on btrfs.  There has been a great deal of 
discussion ok the dpkg mailing lists about it since plenty of people are 
upset that dpkg runs much slower these days than it used to, because it 
now calls fsync() before rename() in order to avoid breakage on ext4.


You can read more, including the rationale of why POSIX does not require 
this behavior at http://lwn.net/Articles/323607/.


I still say that preserving the order of the writes and rename is the 
only sane thing to do, whether POSIX requires it or not.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-07 Thread Phillip Susi

On 01/07/2011 09:58 AM, Chris Mason wrote:

Yes and no.  We have a best effort mechanism where we try to guess that
since you've done this truncate and the write that you want the writes
to show up quickly.  But its a guess.


It is a pretty good guess, and one that the NT kernel has been making 
for 15 years or so.  I've been following this issue for some time and I 
still don't understand why Ted is so hostile to this and can't make it 
work right on ext4.  When you get a rename() you just need to check if 
there are outstanding journal transactions and/or dirty cache pages, and 
hang the rename() transaction on the end of those.  That way if the 
system crashes after the new file has fully hit the disk, the old file 
is gone and you only have the new one, but if it crashes before, you 
still have the old one in place.


Both the writes and the rename can be delayed in the cache to an 
arbitrary point in the future; what matters is that their order is 
preserved.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-02 Thread Phillip Susi

On 12/02/2010 04:49 AM, Arne Jansen wrote:

What about the alternative and allocating inode numbers globally? The only
problem would be with snapshots as they share the inum with the source, but
one could just remap inode numbers in snapshots by sparing some bits at the
top of this 64 bit field.


I was wondering this as well.  Why give each subvol its own inode number 
space?  To avoid breaking assumptions of various programs, if they each 
have their own inode space, they must each have a unique st_dev.  How 
are inode numbers currently allocated, and why wouldn't it be simple to 
just have a single pool of inode numbers for all subvols?  It seems 
obvious to me that snapshots start out inheriting the inode numbers of 
the original subvol, but must be given a new st_dev.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Chunk map/control

2010-11-10 Thread Phillip Susi
On 9/22/2010 4:12 PM, Phillip Susi wrote:
 Is there currently a way to view and manipulate the chunks?  If I
 understand things correctly, a new fs has a few chunks:
 
 1)  System chunk.  Contains tree of trees, device tree, chunk tree.
 2)  Metadata chunk.  Contains the directory tree for the default subvol,
 and any additional subvols/snapshots you create.  Directory entries and
 inodes are in this tree.
 3)  Data chunk.  Files with significant data have blocks allocated from
 this chunk.
 
 The system chunk is always mirrored, even on a single disk.  The
 metadata chunk is mirrored by default, but can be changed with a
 parameter to mkfs.  The data chunk is striped by default, but can be
 changed via parameter to mkfs.  The chunks are all expanded as needed.
 Is this correct, and is there a way to create a subvol with a new pair
 of metadata/data chunks and specify how they should be striped or
 mirrored across what devices?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: find subvolume (used) size?

2010-10-05 Thread Phillip Susi
On 10/5/2010 7:26 AM, Tomasz Chmielewski wrote:
 There is a standard df tool, but it can be a lengthy process for
 filesystems with lots of files.

Maybe you mean du?  df takes almost no time at all and does not care how
many files there are.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html