Re: Is it safe to use btrfs on top of different types of devices?

2017-10-20 Thread Austin S. Hemmelgarn

On 2017-10-19 14:39, Peter Grandi wrote:

[ ... ]

Oh please, please a bit less silliness would be welcome here.
In a previous comment on this tedious thread I had written:



If the block device abstraction layer and lower layers work
correctly, Btrfs does not have problems of that sort when
adding new devices; conversely if the block device layer and
lower layers do not work correctly, no mainline Linux
filesystem I know can cope with that.



Note: "work correctly" does not mean "work error-free".



The last line is very important and I added it advisedly.

[ ... ]

Filesystems run on top of *block-devices* with a definite
interface and a definite state machine, and filesystems in
general assume that the block-device works *correctly*.



They do run on top of USB or SATA devices, otherwise a
significant majority of systems running Linux and/or BSD
should not be operating right now.


That would be big news to any Linux/UNIX filesystem developer,
who would have to rush to add SATA and USB protocol and state
machine handling to their implementations, which currently only
support the block-device protocol and state machine.
Please send patches :-)
In casual conversation, the concept of something being on top of 
something else is usually a transitive property, higher layers are still 
built on top of lower layers, irrespective of what layers are in between 
them (though this does not preclude lower layers from being 
interchangeable).  Filesystems run on top of SATA or USB devices because 
they run on top of the block layer, which in turn runs on top of SATA 
and USB devices.  This is no different than how SSH runs on top of IP 
because it runs on TCP or SCTP, which in turn run on top of IP.


   Note to some readers: there are filesystems designed to work
   on top not of block devices, like on top the MTD abstraction
   layer, for example.


Yes, they don't directly access them, but the block layer
isn't much more than command translation, scheduling, and
accounting, so this distinction is meaningless and largely
irrelevant.


More tedious silliness and grossly ignorant too, because the
protocol and state machine of the block-device layer is
completely different from that of both SATA and USB, and the
mapping of the SATA or USB protocols and state machines onto the
block-device ones is actually a very complex, difficult, and
error prone task, involving mountains of very hairy code. In
particular since the block-device protocol and state machine are
rather simplistic, a lot is lost in translation.

   Note: the SATA handling firmware in disk device often involves
   *dozens of thousands* of lines of code, and "all it does" is
   "just" reading the device and passing the content over the IO
   bus.

Filesystems are designed to that very simplistic protocol and
state machine for good reasons, and sometimes they are designed
to even just a subset; for example most filesystem designs
assume that block-device writes never fail (that is, bad sector
sparing is done by a lower layer), and only some handle
gracefully block-device read failures.
Yes, and the block layer is still largely a protocol translator and 
scheduler.  There may be a great deal of _internal_ complexity 
(especially with SATA and USB, since they both get routed through the 
SCSI layer below the block layer), but that's not really all that 
relevant to the discussion since we can reasonably assume that it is 
functionally perfectly reliable (and it largely is).


At the point at which it's statistically impossible for a given layer to 
be the source of an error and that layer passes errors from lower layers 
up to higher ones (possibly with some translation) without handling them 
itself, then that layer becomes irrelevant for practical discussions of 
error handling, until you can prove that it was the source of the error 
being discussed.



[ ... ] to refer to a block-device connected via interface 'X'
as an 'X device' or an 'X storage device'.


More tedious silliness as this is a grossly misleading shorthand
when the point of the discussion is the error recovery protocol
and state machine assumed by filesystem designers. To me it see
that if people use that shorthand in that context, as if it was
not a shorthand, they don't know what they are talking about, or
they are trying to mislead the discussion.
You mean like your whole tirade about my terminology being fundamentally 
wrong?  Everyone else in the thread appears to have understood what I 
meant perfectly, except you.

 >> [ ... ] For an end user, it generally doesn't matter whether a

given layer reported the error or passed it on (or generated
it), it matters whether it was corrected or not. [ ... ]


You seem unable or unwilling to appreciate how detected and
undetected errors are fundamentally different, and how layering
of greatly different protocols is a complicated issue highly
relevant to error recovery, so you seem to assume that other end
users are likewise unable or 

Re: Is it safe to use btrfs on top of different types of devices?

2017-10-19 Thread Peter Grandi
[ ... ]

>> are USB drives really that unreliable [ ... ]
[ ... ]
> There are similar SATA chips too (occasionally JMicron and
> Marvell for example are somewhat less awesome than they could
> be), and practically all Firewire bridge chips of old "lied" a
> lot [ ... ]
> That plus Btrfs is designed to work on top of a "well defined"
> block device abstraction that is assumed to "work correctly"
> (except for data corruption), [ ... ]

When I insist on the reminder that Btrfs is designed to use the
block-device protocol and state machine, rather than USB and
SATA devices, it is because that makes more explicit that the
various layer between the USB and SATA device can "lie" too,
including for example the Linux page cache which is just below
the block-device layer. But also the disk scheduler, the SCSI
protocol handler, the USB and SATA drivers and disk drivers, the
PCIe chipset, the USB or SATA host bus adapter, the cable, the
backplane.

This paper reports the results of some testing of "enterprise
grade" storage systems at CERN, and some of the symptoms imply
that "lies" can happen *anywhere*. It is scary. It supports
having data checksumming in the filesystem, a rather extreme
choice.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-19 Thread Peter Grandi
[ ... ]
>>> Oh please, please a bit less silliness would be welcome here.
>>> In a previous comment on this tedious thread I had written:

> If the block device abstraction layer and lower layers work
> correctly, Btrfs does not have problems of that sort when
> adding new devices; conversely if the block device layer and
> lower layers do not work correctly, no mainline Linux
> filesystem I know can cope with that.

> Note: "work correctly" does not mean "work error-free".

>>> The last line is very important and I added it advisedly.
[ ... ]
>> Filesystems run on top of *block-devices* with a definite
>> interface and a definite state machine, and filesystems in
>> general assume that the block-device works *correctly*.

> They do run on top of USB or SATA devices, otherwise a
> significant majority of systems running Linux and/or BSD
> should not be operating right now.

That would be big news to any Linux/UNIX filesystem developer,
who would have to rush to add SATA and USB protocol and state
machine handling to their implementations, which currently only
support the block-device protocol and state machine.
Please send patches :-)

  Note to some readers: there are filesystems designed to work
  on top not of block devices, like on top the MTD abstraction
  layer, for example.

> Yes, they don't directly access them, but the block layer
> isn't much more than command translation, scheduling, and
> accounting, so this distinction is meaningless and largely
> irrelevant.

More tedious silliness and grossly ignorant too, because the
protocol and state machine of the block-device layer is
completely different from that of both SATA and USB, and the
mapping of the SATA or USB protocols and state machines onto the
block-device ones is actually a very complex, difficult, and
error prone task, involving mountains of very hairy code. In
particular since the block-device protocol and state machine are
rather simplistic, a lot is lost in translation.

  Note: the SATA handling firmware in disk device often involves
  *dozens of thousands* of lines of code, and "all it does" is
  "just" reading the device and passing the content over the IO
  bus.

Filesystems are designed to that very simplistic protocol and
state machine for good reasons, and sometimes they are designed
to even just a subset; for example most filesystem designs
assume that block-device writes never fail (that is, bad sector
sparing is done by a lower layer), and only some handle
gracefully block-device read failures.

> [ ... ] to refer to a block-device connected via interface 'X'
> as an 'X device' or an 'X storage device'.

More tedious silliness as this is a grossly misleading shorthand
when the point of the discussion is the error recovery protocol
and state machine assumed by filesystem designers. To me it see
that if people use that shorthand in that context, as if it was
not a shorthand, they don't know what they are talking about, or
they are trying to mislead the discussion.

> [ ... ] For an end user, it generally doesn't matter whether a
> given layer reported the error or passed it on (or generated
> it), it matters whether it was corrected or not. [ ... ]

You seem unable or unwilling to appreciate how detected and
undetected errors are fundamentally different, and how layering
of greatly different protocols is a complicated issue highly
relevant to error recovery, so you seem to assume that other end
users are likewise unable or unwilling.

But I am not so dismissive of "end users", and I assume that
there are end users that can eventually understand that Btrfs in
the main is not designed to handle devices that "lie" because
Btrfs actually is designed to use the block-device layer which
is assumed to "work correctly" (except for checksums).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-19 Thread Peter Grandi
> [ ... ] when writes to a USB device fail due to a temporary
> disconnection, the kernel can actually recognize that a write
> error happened. [ ... ]

Usually, but who knows? Maybe half transfer gets written; maybe
the data gets written to the wrong address; maybe stuff gets
written but failure is reported, and this not just if the
connection dies, but also if it does not.

> are USB drives really that unreliable [ ... ]

Welcome to the "real world", also called "Shenzen" :-).

There aren't that many "USB drives", as I wrote somewhere there
are usually USB host bus adapters (on the system side) and USB
IO bus (usually SATA) bridges (on the device side).

They both have to do difficult feats of conversion and signaling,
and in the USB case they are usually designed by a stressed,
overworked engineer in Guangzhou or Taiwan employed by a no-name
contractor working who submitted the lowest bid to a no-name
manufacturer, and was told to do the cheapest design to fabricate
in the shortest possible time. Most of the time they mostly work,
good enough for keyboard and mice, and for photos of cats on usb
sticks; most users jut unplug and replug them in if they flake
out. BTW my own USB keyboard and mice and their USB host bus
adapter occasionaly crash too, and the cases where my webcam
flakes out are more common than when it does not. USB is a mixed
bag of poorly designed protocols and complex too, and it is very
easy to do a bad implementation.

There are similar SATA chips too (occasionally JMicron and
Marvell for example are somewhat less awesome than they could
be), and practically all Firewire bridge chips of old "lied" a
lot except a few Oxford Semi ones (the legendary 911 series).
I have even seen lying SAS "enterprise" grade storage
interconnects. I had indeed previously written:

  > If you have concerns about the reliability of specific
  > storage and system configurations you should become or find a
  > system integration and qualification engineer who understand
  > the many subletities of storage devices and device-system
  > interconnects and who would run extensive tests on it;
  > storage and system commissioning is often far from trivial
  > even in seemingly simple cases, due in part to the enormous
  > complexity of interfaces, even when they have few bugs, and
  > test made with one combination often do not have the same
  > results even on apparently similar combinations.

On the #Btrfs IRC channel there is a small group of cynical
helpers, and when someone mentions "strange things happening" one
of them usually immediately asks "USB?" and in most cases the
answer is "how did you know?".

That plus Btrfs is designed to work on top of a "well defined"
block device abstraction that is assumed to "work correctly"
(except for data corruption), and the Linux block device
abstraction and SATA and USB layers beneath it are not designed
to handle devices that "lie" (well, there are blacklists with
workaround for known systematic bugs, but that is partial).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-19 Thread Peter Grandi
> [ ... ] However, the disappearance of the device doesn't get
> propagated up to the filesystem correctly,

Indeed, sometimes it does, sometimes it does not, in part
because of chipset bugs, in part because the USB protocol
signaling side does not handle errors well even if the chipset
were bug free.

> and that is what causes the biggest issue with BTRFS. Because
> BTRFS just knows writes are suddenly failing for some reason,
> it doesn't try to release the device so that things get
> properly cleaned up in the kernel, and thus when the same
> device reappears (as it will when the disconnect was due to a
> transient bus error, which happens a lot), it shows up as a
> different device node, which gets scanned for filesystems by
> udev, and BTRFS then gets really confused because it now sees
> 3 (or more) devices for a 2 device filesystem.

That's a good description that should be on the wiki.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-19 Thread Austin S. Hemmelgarn

On 2017-10-19 10:42, Zoltan wrote:

On Thu, Oct 19, 2017 at 4:27 PM, Austin S. Hemmelgarn
 wrote:


and thus when the same device reappears (as it will when the disconnect was
due to a transient bus error, which happens a lot), it shows up as a
different device node, which gets scanned for filesystems by udev, and BTRFS
then gets really confused because it now sees 3 (or more) devices for a 2
device filesystem.


And what would happen with a regular, single-device BTRFS volume after
a reconnect? Isn't this issue just as bad for that case?
No, because the multi-device code only gets used if the filesystem 
claims to have more than one device, and it's a bug in the multi-device 
code that causes this problem.  From a data safety perspective, the 
disconnect will look like a power loss event if it was a single device 
filesystem, and BTRFS handles that situation fine (though you would 
probably need to remount the filesystem).


FWIW, the same bug causes similar data loss problems with block-level 
copies of BTRFS filesystems (if you then mount either the original or 
the copy while both are visible to the system), and allows you to screw 
up multi-device filesystems by connecting a storage device with a 
carefully crafted bogus BTRFS filesystem on it.  Overall though, it's 
not been seen as a high priority bug because:


1. Nobody has come up with a reliable method of handling it that doesn't 
break anything or require revising the on-disk layout.
2. It's easy to work around (don't do block level copies and ensure 
proper physical security of the system like you should be doing anyway).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-19 Thread Zoltan
On Thu, Oct 19, 2017 at 4:27 PM, Austin S. Hemmelgarn
 wrote:

> and thus when the same device reappears (as it will when the disconnect was
> due to a transient bus error, which happens a lot), it shows up as a
> different device node, which gets scanned for filesystems by udev, and BTRFS
> then gets really confused because it now sees 3 (or more) devices for a 2
> device filesystem.

And what would happen with a regular, single-device BTRFS volume after
a reconnect? Isn't this issue just as bad for that case?

Thanks,

Zoltan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-19 Thread Austin S. Hemmelgarn

On 2017-10-19 09:48, Zoltan wrote:

Hi,

On Thu, Oct 19, 2017 at 1:01 PM, Peter Grandi  
wrote:


What the OP was doing was using "unreliable" both for the case
where the device "lies" and the case where the device does not
"lie" but reports a failure. Both of these are malfunctions in a
wide sense:

   * The [block] device "lies" as to its status or what it has done.
   * The [block] device reports truthfully that an action has failed.


Thanks for making this point, it made me realize that I had different
assumption than what you use in your reasoning. I assumed that when
writes to a USB device fail due to a temporary disconnection, the
kernel can actually recognize that a write error happened. So are you
saying that a write error due to USB problems can go completely
unnoticed? That seems very strange to me; are USB drives really that
unreliable or is that some software limitation?

It depends on what type of write error happens.

If it's a case where the data gets corrupted on it's way over the bus, 
or the device just drops the write, or you have a bogus storage device 
(this is actually a pretty big issue with flash drives and SD cards, 
check [1], and [2] for more info on this, and [3] for a tool you can use 
to check things), then it generally won't be detected by the kernel, but 
might be by the filesystem driver when it tries to read data.


However, it doesn't go completely undetected if the device disconnects 
(which is where the big issue with BTRFS comes in), the kernel will 
detect the disconnect, issue a bus reset (which will cause performance 
issues with other USB devices on the same controller), and generally 
recover.  However, the disappearance of the device doesn't get 
propagated up to the filesystem correctly, and that is what causes the 
biggest issue with BTRFS.  Because BTRFS just knows writes are suddenly 
failing for some reason, it doesn't try to release the device so that 
things get properly cleaned up in the kernel, and thus when the same 
device reappears (as it will when the disconnect was due to a transient 
bus error, which happens a lot), it shows up as a different device node, 
which gets scanned for filesystems by udev, and BTRFS then gets really 
confused because it now sees 3 (or more) devices for a 2 device 
filesystem.  That final resultant state is what's so dangerous about 
using USB devices with BTRFS right now, as it's pretty much guaranteed 
to result in data corruption.



[1] https://fightflashfraud.wordpress.com/
[2] https://sosfakeflash.wordpress.com/
[3] http://oss.digirati.com.br/f3/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-19 Thread Zoltan
Hi,

On Thu, Oct 19, 2017 at 1:01 PM, Peter Grandi  
wrote:

> What the OP was doing was using "unreliable" both for the case
> where the device "lies" and the case where the device does not
> "lie" but reports a failure. Both of these are malfunctions in a
> wide sense:
>
>   * The [block] device "lies" as to its status or what it has done.
>   * The [block] device reports truthfully that an action has failed.

Thanks for making this point, it made me realize that I had different
assumption than what you use in your reasoning. I assumed that when
writes to a USB device fail due to a temporary disconnection, the
kernel can actually recognize that a write error happened. So are you
saying that a write error due to USB problems can go completely
unnoticed? That seems very strange to me; are USB drives really that
unreliable or is that some software limitation?

Thanks,

Zoltan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-19 Thread Austin S. Hemmelgarn

On 2017-10-19 07:01, Peter Grandi wrote:

[ ... ]


Oh please, please a bit less silliness would be welcome here.
In a previous comment on this tedious thread I had written:



If the block device abstraction layer and lower layers work
correctly, Btrfs does not have problems of that sort when
adding new devices; conversely if the block device layer and
lower layers do not work correctly, no mainline Linux
filesystem I know can cope with that.



Note: "work correctly" does not mean "work error-free".


The last line is very important and I added it advisedly.



Even looking at things that way though, Zoltan's assessment
that reliability is essentially a measure of error rate is
correct.


It is instead based on a grave confusion between two very
different kinds of "error rate", confusion also partially based
on the ridiculous misunderstanding, which I have already pointed
out, that UNIX filesystems run on top of SATA or USB devices:


Internal SATA devices absolutely can randomly drop off the bus
just like many USB storage devices do,


Filesystems run on top of *block devices* with a definite
interface and a definite state machine, and filesystems in
general assume that the block device works *correctly*.
They do run on top of USB or SATA devices, otherwise a significant 
majority of systems running Linux and/or BSD should not be operating 
right now.  Yes, they don't directly access them, but the block layer 
isn't much more than command translation, scheduling, and accounting, so 
this distinction is meaningless and largely irrelevant.  It's also 
pretty standard practice among most sane sysadmins who aren't trying to 
be jerks, as well as most kernel developers I've met, is to refer to a 
block device connected via interface 'X' as an 'X device' or an 'X 
storage device'.



but it almost never happens (it's a statistical impossibility
if there are no hardware or firmware issues), so they are more
reliable in that respect.


What the OP was doing was using "unreliable" both for the case
where the device "lies" and the case where the device does not
"lie" but reports a failure. Both of these are malfunctions in a
wide sense:

   * The [block] device "lies" as to its status or what it has done.
   * The [block] device reports truthfully that an action has failed.

But they are of very different nature and need completely
different handling. Hint: one is an extensional property and the
other is a modal one, there is a huge difference between "this
data is wrong" and "I know that this data is wrong".

The really important "detail" is that filesystems are, as a rule
with very few exceptions, designed to work only if the block
device layer (and those below it) does not "lie" (see "Bizantyne
failures" below), that is "works correctly": reports the failure
of every operation that fails and the success of every operation
that succeeds and never gets into an unexpected state.

In particular filesystems designs are nearly always based on the
assumption that there are no undetected errors at the block
device level or below. Then the expected *frequency* of detected
errors influences how much redundancy and what kind of recovery
are desirable, but the frequency of "lies" is assumed to be
zero.

The one case where Btrfs does not assume that the storage layer
works *correctly* is checksumming: it is quite expensive and
makes sense only if the block device is expected to (sometimes)
"lie" about having written the data correctly or having read it
correctly. The role of the checksum is to spot when a block
device "lies" and turn an undetected read error into a detected
one (they could be used also to detect correct writes that are
misreported as having failed).

The crucial difference that exists between SATA and USB is not
that USB chips have higher rates of detected failures (even if
they often do), but that in my experience SATA interfaces from
reputable suppliers don't "lie" (more realistically have
negligible "lie" rates), and USB interfaces (both host bus
adapters and IO bus bridges) "lie" both systematically and
statistically with non negligible rates, and anyhow the USB mass
storage protocol is not very good at error reporting and
handling.
You do realize you just said exactly what I was saying, just in a more 
general and much more verbose manner which involved explaining things 
that are either well known and documented or aren't even entirely 
relevant to the thread in question?


For an end user, it generally doesn't matter whether a given layer 
reported the error or passed it on (or generated it), it matters whether 
it was corrected or not.  If the subset of the storage stack below 
whatever layer is being discussed (in this case the filesystem) causes 
errors at a rate deemed unacceptable for the given application that it 
does not correct, it's unreliable, regardless of whether or not they get 
corrected at this layer or a higher layer.  Even if you're running BTRFS 
on top of it, a SATA connected 

Re: Is it safe to use btrfs on top of different types of devices?

2017-10-19 Thread Peter Grandi
[ ... ]

>> Oh please, please a bit less silliness would be welcome here.
>> In a previous comment on this tedious thread I had written:

>> > If the block device abstraction layer and lower layers work
>> > correctly, Btrfs does not have problems of that sort when
>> > adding new devices; conversely if the block device layer and
>> > lower layers do not work correctly, no mainline Linux
>> > filesystem I know can cope with that.
>> 
>> > Note: "work correctly" does not mean "work error-free".
>> 
>> The last line is very important and I added it advisedly.

> Even looking at things that way though, Zoltan's assessment
> that reliability is essentially a measure of error rate is
> correct.

It is instead based on a grave confusion between two very
different kinds of "error rate", confusion also partially based
on the ridiculous misunderstanding, which I have already pointed
out, that UNIX filesystems run on top of SATA or USB devices:

> Internal SATA devices absolutely can randomly drop off the bus
> just like many USB storage devices do,

Filesystems run on top of *block devices* with a definite
interface and a definite state machine, and filesystems in
general assume that the block device works *correctly*.

> but it almost never happens (it's a statistical impossibility
> if there are no hardware or firmware issues), so they are more
> reliable in that respect.

What the OP was doing was using "unreliable" both for the case
where the device "lies" and the case where the device does not
"lie" but reports a failure. Both of these are malfunctions in a
wide sense:

  * The [block] device "lies" as to its status or what it has done.
  * The [block] device reports truthfully that an action has failed.

But they are of very different nature and need completely
different handling. Hint: one is an extensional property and the
other is a modal one, there is a huge difference between "this
data is wrong" and "I know that this data is wrong".

The really important "detail" is that filesystems are, as a rule
with very few exceptions, designed to work only if the block
device layer (and those below it) does not "lie" (see "Bizantyne
failures" below), that is "works correctly": reports the failure
of every operation that fails and the success of every operation
that succeeds and never gets into an unexpected state.

In particular filesystems designs are nearly always based on the
assumption that there are no undetected errors at the block
device level or below. Then the expected *frequency* of detected
errors influences how much redundancy and what kind of recovery
are desirable, but the frequency of "lies" is assumed to be
zero.

The one case where Btrfs does not assume that the storage layer
works *correctly* is checksumming: it is quite expensive and
makes sense only if the block device is expected to (sometimes)
"lie" about having written the data correctly or having read it
correctly. The role of the checksum is to spot when a block
device "lies" and turn an undetected read error into a detected
one (they could be used also to detect correct writes that are
misreported as having failed).

The crucial difference that exists between SATA and USB is not
that USB chips have higher rates of detected failures (even if
they often do), but that in my experience SATA interfaces from
reputable suppliers don't "lie" (more realistically have
negligible "lie" rates), and USB interfaces (both host bus
adapters and IO bus bridges) "lie" both systematically and
statistically with non negligible rates, and anyhow the USB mass
storage protocol is not very good at error reporting and
handling.

>> The "working incorrectly" general case is the so called
>> "bizantine generals problem" [ ... ]

This is compsci for beginners and someone dealing with storage
issues (and not just) should be intimately familiar with the
implications:

  https://en.wikipedia.org/wiki/Byzantine_fault_tolerance

  Byzantine failures are considered the most general and most
  difficult class of failures among the failure modes. The
  so-called fail-stop failure mode occupies the simplest end of
  the spectrum. Whereas fail-stop failure model simply means
  that the only way to fail is a node crash, detected by other
  nodes, Byzantine failures imply no restrictions, which means
  that the failed node can generate arbitrary data, pretending
  to be a correct one, which makes fault tolerance difficult.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-18 Thread Austin S. Hemmelgarn

On 2017-10-18 07:59, Adam Borowski wrote:

On Wed, Oct 18, 2017 at 07:30:55AM -0400, Austin S. Hemmelgarn wrote:

On 2017-10-17 16:21, Adam Borowski wrote:

It's a single-device filesystem, thus disconnects are obviously fatal.  But,
they never caused even a single bit of damage (as scrub goes), thus proving
btrfs handles this kind of disconnects well.  Unlike times past, the kernel
doesn't get confused thus no reboot is needed, merely an unmount, "service
nbd-client restart", mount, restart the rebuild jobs.

That's expected behavior though.  _Single_ device BTRFS has nothing to get
out of sync most of the time, the only time there's any possibility of an
issue is when you die after writing the first copy of a block that's in a
dup profile chunk, but even that is not very likely to cause problems
(you'll just lose at most the last  worth of data).


How come?  In a DUP profile, the writes are: chunk 1, chunk2, barrier,
superblock.  The two prior writes may be arbitrarily reordered -- both
between each other or even individual sectors inside the chunks, but unless
the disk lies about barriers, there's no way to have any corruption, thus
running scrub is not needed.

If the device dies after writing chunk 1 but before the barrier, you end up
needing scrub.  How much of a failure window is present is largely a
function of how fast the device is, but there is a failure window there.


CoW is there to ensure there is _no_ failure window.  The new content
doesn't matter until there are live pointers to it -- from the filesystem's
point of view we merely scribbled something on an unused part of the block
device.  Only after all pieces are in place (as ensured by the barrier), the
superblock is updated with a reference to the new metadata->data chain.
Even with CoW there _IS_ a failure window.  At a bare minimum, when 
updating the root of the tree which has multiple copies, you have a 
failure window.  This window could admittedly be significantly reduced 
for multi-device setups if we actually parallelized writes properly, but 
it would still be there.


Thus, no matter when a disconnect happens, after a crash you get either
uncorrupted old version or uncorrupted new version.

No scrub is ever needed for this reason on single device or on RAID1 that
didn't run degraded.
The whole conversation started regarding a RAID1 array that's 
functionally guaranteed to run degraded on a regular basis.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-18 Thread Austin S. Hemmelgarn

On 2017-10-18 09:53, Peter Grandi wrote:

I forget sometimes that people insist on storing large
volumes of data on unreliable storage...


Here obviously "unreliable" is used on the sense of storage that
can work incorrectly, not in the sense of storage that can fail.
Um, in what world is a device randomly dropping off the bus (this is the 
primary issue with USB for storage) not a failure?  Yes, it's not a 
catastrophic failure (for BTRFS at least), and it's transient (the 
kernel will re-enumerate the device when it resets the bus), but that 
doesn't change the fact that the service that is supposed to be provided 
by the device failed.


To clarify more concretely, when I say 'unreliable' in reference to 
computers technology (and for that matter, almost anything else), I mean 
something that encounters non-trivial error states, either correctable 
or uncorrectable, at a frequency above that which is deemed reasonable 
for the designed function of the device.



In my opinion the unreliability of the storage is the exact
reason for wanting to use raid1. And I think any problem one
encounters with an unreliable disk can likely happen with more
reliable ones as well, only less frequently, so if I don't
feel comfortable using raid1 on an unreliable medium then I
wouldn't trust it on a more reliable one either.


Oh please, please a bit less silliness would be welcome here.
In a previous comment on this tedious thread I had written:

   > If the block device abstraction layer and lower layers work
   > correctly, Btrfs does not have problems of that sort when
   > adding new devices; conversely if the block device layer and
   > lower layers do not work correctly, no mainline Linux
   > filesystem I know can cope with that.

   > Note: "work correctly" does not mean "work error-free".

The last line is very important and I added it advisedly.

You seem to be using "unreliable" in two completely different
meanings, without realizing it, as both "working incorrectly"
and "reporting a failure". They are really very different.
And you seem to be using the term 'failure' to only mean 'catastrophic 
failure'.  Strictly speaking, even that is 'working incorrectly', albeit 
in a much more specific and permanent manner than just returning errors.


Even looking at things that way though, Zoltan's assessment that 
reliability is essentially a measure of error rate is correct.  Internal 
SATA devices absolutely can randomly drop off the bus just like many USB 
storage devices do, but it almost never happens (it's a statistical 
impossibility if there are no hardware or firmware issues), so they are 
more reliable in that respect.


The "working incorrectly" general case is the so called
"bizantine generals problem" and (depending on assumptions) it
is insoluble.

Btrfs has some limited ability to detect (and sometimes recover
from) "working incorrectly" storage layers, but don't expect too
much from that.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-18 Thread Peter Grandi
> [ ... ] After all, btrfs would just have to discard one copy
> of each chunk. [ ... ]  One more thing that is not clear to me
> is the replication profile of a volume. I see that balance can
> convert chunks between profiles, for example from single to
> raid1, but I don't see how the default profile for new chunks
> can be set or quiered. [ ... ]

My impression is that the design rationale and aims for Btrfs
two-level allocation (in other fields known as a "BIBOP" scheme)
were not fully shared among Btrfs developers, that perhaps it
could have benefited from some further reflection on its
implications, and that its behaviour may have evolved
"opportunistically", maybe without much worrying as to
conceptual integrity. (I am trying to be euphemistic)

So while I am happy with the "Rodeh" core of Btrfs (COW,
sbuvolumes, checksums), the RAID-profile functionality and
especially the multi-device layer is not something I find
particularly to my taste. (I am trying to be euphemistic)

So when it comes to allocation, RAID-profiles, multiple devices,
I usually expect some random "surprising functionality". (I am
trying to be euphemistic)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-18 Thread Peter Grandi
>> I forget sometimes that people insist on storing large
>> volumes of data on unreliable storage...

Here obviously "unreliable" is used on the sense of storage that
can work incorrectly, not in the sense of storage that can fail.

> In my opinion the unreliability of the storage is the exact
> reason for wanting to use raid1. And I think any problem one
> encounters with an unreliable disk can likely happen with more
> reliable ones as well, only less frequently, so if I don't
> feel comfortable using raid1 on an unreliable medium then I
> wouldn't trust it on a more reliable one either.

Oh please, please a bit less silliness would be welcome here.
In a previous comment on this tedious thread I had written:

  > If the block device abstraction layer and lower layers work
  > correctly, Btrfs does not have problems of that sort when
  > adding new devices; conversely if the block device layer and
  > lower layers do not work correctly, no mainline Linux
  > filesystem I know can cope with that.

  > Note: "work correctly" does not mean "work error-free".

The last line is very important and I added it advisedly.

You seem to be using "unreliable" in two completely different
meanings, without realizing it, as both "working incorrectly"
and "reporting a failure". They are really very different.

The "working incorrectly" general case is the so called
"bizantine generals problem" and (depending on assumptions) it
is insoluble.

Btrfs has some limited ability to detect (and sometimes recover
from) "working incorrectly" storage layers, but don't expect too
much from that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-18 Thread Adam Borowski
On Wed, Oct 18, 2017 at 07:30:55AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-10-17 16:21, Adam Borowski wrote:
> > > > It's a single-device filesystem, thus disconnects are obviously fatal.  
> > > > But,
> > > > they never caused even a single bit of damage (as scrub goes), thus 
> > > > proving
> > > > btrfs handles this kind of disconnects well.  Unlike times past, the 
> > > > kernel
> > > > doesn't get confused thus no reboot is needed, merely an unmount, 
> > > > "service
> > > > nbd-client restart", mount, restart the rebuild jobs.
> > > That's expected behavior though.  _Single_ device BTRFS has nothing to get
> > > out of sync most of the time, the only time there's any possibility of an
> > > issue is when you die after writing the first copy of a block that's in a
> > > dup profile chunk, but even that is not very likely to cause problems
> > > (you'll just lose at most the last  worth of data).
> > 
> > How come?  In a DUP profile, the writes are: chunk 1, chunk2, barrier,
> > superblock.  The two prior writes may be arbitrarily reordered -- both
> > between each other or even individual sectors inside the chunks, but unless
> > the disk lies about barriers, there's no way to have any corruption, thus
> > running scrub is not needed.
> If the device dies after writing chunk 1 but before the barrier, you end up
> needing scrub.  How much of a failure window is present is largely a
> function of how fast the device is, but there is a failure window there.

CoW is there to ensure there is _no_ failure window.  The new content
doesn't matter until there are live pointers to it -- from the filesystem's
point of view we merely scribbled something on an unused part of the block
device.  Only after all pieces are in place (as ensured by the barrier), the
superblock is updated with a reference to the new metadata->data chain.

Thus, no matter when a disconnect happens, after a crash you get either
uncorrupted old version or uncorrupted new version.

No scrub is ever needed for this reason on single device or on RAID1 that
didn't run degraded.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Imagine there are bandits in your house, your kid is bleeding out,
⢿⡄⠘⠷⠚⠋⠀ the house is on fire, and seven big-ass trumpets are playing in the
⠈⠳⣄ sky.  Your cat demands food.  The priority should be obvious...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-18 Thread Austin S. Hemmelgarn

On 2017-10-17 16:21, Adam Borowski wrote:

On Tue, Oct 17, 2017 at 03:19:09PM -0400, Austin S. Hemmelgarn wrote:

On 2017-10-17 13:06, Adam Borowski wrote:

The thing is, reliability guarantees required vary WILDLY depending on your
particular use cases.  On one hand, there's "even an one-minute downtime
would cost us mucho $$$s, can't have that!" -- on the other, "it died?
Okay, we got backups, lemme restore it after the weekend".

Yes, but if you are in the second case, you arguably don't need replication,
and would be better served by improving the reliability of your underlying
storage stack than trying to work around it's problems. Even in that case,
your overall reliability is still constrained by the least reliable
component (in more idiomatic terms 'a chain is only as strong as it's
weakest link').


MD can handle this case well, there's no reason btrfs shouldn't do that too.
A RAID is not akin to serially connected chain, it's a parallel connected
chain: while pieces of the broken second chain hanging down from the first
don't make it strictly more resilient than having just a single chain, in
general case it _is_ more reliable even if the other chain is weaker.
My chain analogy is supposed to be relating to the storage stack as a 
whole, RAID is a single link in the chain, with whatever filesystem 
above it, and whatever storage drivers and hardware below.


Don't we have a patchset that deals with marking a device as failed at
runtime floating on the mailing list?  I did not look at those patches yet,
but they are a step in this direction.
There were some disagreements on whether the device should be released 
(that is, the node closed) immediately when we know it's failed, or 
should be held open until remount.



Using replication with a reliable device and a questionable device is
essentially the same as trying to add redundancy to a machine by adding an
extra linkage that doesn't always work and can get in the way of the main
linkage it's supposed to be protecting from failure.  Yes, it will work most
of the time, but the system is going to be less reliable than it is without
the 'redundancy'.


That's the current state of btrfs, but the design is sound, and reaching
more than parity with MD is a matter of implementation.
Indeed, however MD is still not perfectly reliable in this situation 
(though they are exponentially better than BTRFS at the moment).



Thus, I switched the machine to NBD (albeit it sucks on 100Mbit eth).  Alas,
the network driver allocates memory with GFP_NOIO which causes NBD
disconnects (somehow, this doesn't ever happen on swap where GFP_NOIO would
be obvious but on regular filesystem where throwing out userspace memory is
safe).  The disconnects happen around once per week.

Somewhat off-topic, but you might try looking at ATAoE as an alternative,
it's more reliable in my experience (if you've got a reliable network),
gives better performance (there's less protocol overhead than NBD, and it
runs on top of layer 2 instead of layer 4)


I've tested it -- not on the Odroid-U2 but on Pine64 (fully working GbE).
NBD delivers 108MB/sec in a linear transfer, ATAoE is lucky to break
40MB/sec, same target (Qnap-253a, spinning rust), both in default
configuration without further tuning.  NBD is over IPv6 for that extra 20
bytes per packet overhead.

Interesting, I've seen the the exact opposite in terms of performance.


Also, NBD can be encrypted or arbitrarily routed.

Yes, though if you're on a local network, neither should matter :).



It's a single-device filesystem, thus disconnects are obviously fatal.  But,
they never caused even a single bit of damage (as scrub goes), thus proving
btrfs handles this kind of disconnects well.  Unlike times past, the kernel
doesn't get confused thus no reboot is needed, merely an unmount, "service
nbd-client restart", mount, restart the rebuild jobs.

That's expected behavior though.  _Single_ device BTRFS has nothing to get
out of sync most of the time, the only time there's any possibility of an
issue is when you die after writing the first copy of a block that's in a
dup profile chunk, but even that is not very likely to cause problems
(you'll just lose at most the last  worth of data).


How come?  In a DUP profile, the writes are: chunk 1, chunk2, barrier,
superblock.  The two prior writes may be arbitrarily reordered -- both
between each other or even individual sectors inside the chunks, but unless
the disk lies about barriers, there's no way to have any corruption, thus
running scrub is not needed.
If the device dies after writing chunk 1 but before the barrier, you end 
up needing scrub.  How much of a failure window is present is largely a 
function of how fast the device is, but there is a failure window there.



The moment you add another device though, that simplicity goes out the
window.


RAID1 doesn't seem less simple to me: if the new superblock has been
successfully written on at least one disk, barriers imply 

Re: Is it safe to use btrfs on top of different types of devices?

2017-10-17 Thread Duncan
Austin S. Hemmelgarn posted on Tue, 17 Oct 2017 15:19:09 -0400 as
excerpted:

>> It's a single-device filesystem, thus disconnects are obviously fatal. 
>> But,
>> they never caused even a single bit of damage (as scrub goes), thus
>> proving btrfs handles this kind of disconnects well.  Unlike times
>> past, the kernel doesn't get confused thus no reboot is needed, merely
>> an unmount, "service nbd-client restart", mount, restart the rebuild
>> jobs.
> That's expected behavior though.  _Single_ device BTRFS has nothing to
> get out of sync most of the time, the only time there's any possibility
> of an issue is when you die after writing the first copy of a block
> that's in a dup profile chunk, but even that is not very likely to cause
> problems (you'll just lose at most the last  worth of
> data).  The moment you add another device though, that simplicity goes
> out the window.

This is why I said if you /must/ work with USB-connected devices, I'd 
suggest single-device btrfs, tho I'd go full dup (data too) to take 
advantage of the btrfs ability to detect checksum errors and rewrite a 
bad copy from the second, hopefully good, copy.

Single device simply doesn't have the sync issues of multi-device, even 
in dup mode, so end up being much more reliable in cases like USB 
connection (and NBD as in the example above), where connection 
reliability is a problem.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-17 Thread Duncan
Zoltán Ivánfi posted on Tue, 17 Oct 2017 23:56:56 +0200 as excerpted:

> I understand that due to the lack of a write intent bitmap, btrfs does
> not know what needs to be resynced exactly, therefore syncing the two
> copies of a raid1 can not happen as fast in btrfs as in md. However, I
> don't understand why converting from raid1 to single is so slow. After
> all, btrfs would just have to discard one copy of each chunk. Still, in
> practice, converting from raid1 to single is just as slow as converting
> from single to raid1.

Remember, each chunk "knows" its own type, and btrfs is COW, so changing 
the type of the remaining chunk requires rewriting it.  Thus, conversion 
from raid1 to single requires rewriting all chunks of the source type.  
Simply discarding the "extra" raid1 chunk won't cut it, because it'd 
still be raid1, not single.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-17 Thread Zoltán Ivánfi
Hi,

I understand that due to the lack of a write intent bitmap, btrfs does
not know what needs to be resynced exactly, therefore syncing the two
copies of a raid1 can not happen as fast in btrfs as in md. However, I
don't understand why converting from raid1 to single is so slow. After
all, btrfs would just have to discard one copy of each chunk. Still,
in practice, converting from raid1 to single is just as slow as
converting from single to raid1.

One more thing that is not clear to me is the replication profile of a
volume. I see that balance can convert chunks between profiles, for
example from single to raid1, but I don't see how the default profile
for new chunks can be set or quiered. After all, the convert filter
changes chunk profiles one by one, and it can be cancelled in the
middle or filtered to a set of chunks only, so the profiles of
different chunks may differ from each other. I don't know whether this
is considered healthy in the long run, but at least during the
conversion it is the expected state for a while. Yet I do not see any
command mentioned anywhere to set or query the default profile for new
chunks, only commands to convert existing chunks, which magically also
seem to set the default profile of new chunks at some point during the
conversion (probably at the end, because once after a convert I had to
convert again since a new chunk got added during the first conversion
with the old profile).

Thanks,

Zoltan

On Tue, Oct 17, 2017 at 10:21 PM, Adam Borowski  wrote:
> On Tue, Oct 17, 2017 at 03:19:09PM -0400, Austin S. Hemmelgarn wrote:
>> On 2017-10-17 13:06, Adam Borowski wrote:
>> > On Tue, Oct 17, 2017 at 08:40:20AM -0400, Austin S. Hemmelgarn wrote:
>> > > On 2017-10-17 07:42, Zoltan wrote:
>> > > > On Tue, Oct 17, 2017 at 1:26 PM, Austin S. Hemmelgarn
>> > > >  wrote:
>> > > >
>> > > > > I forget sometimes that people insist on storing large volumes of 
>> > > > > data on
>> > > > > unreliable storage...
>> > > >
>> > > > In my opinion the unreliability of the storage is the exact reason for
>> > > > wanting to use raid1. And I think any problem one encounters with an
>> > > > unreliable disk can likely happen with more reliable ones as well,
>> > > > only less frequently, so if I don't feel comfortable using raid1 on an
>> > > > unreliable medium then I wouldn't trust it on a more reliable one
>> > > > either.
>> >
>> > > The thing is that you need some minimum degree of reliability in the 
>> > > other
>> > > components in the storage stack for it to be viable to use any given 
>> > > storage
>> > > technology.  If you don't meet that minimum degree of reliability, then 
>> > > you
>> > > can't count on the reliability guarantees of the storage technology.
>> >
>> > The thing is, reliability guarantees required vary WILDLY depending on your
>> > particular use cases.  On one hand, there's "even an one-minute downtime
>> > would cost us mucho $$$s, can't have that!" -- on the other, "it died?
>> > Okay, we got backups, lemme restore it after the weekend".
>> Yes, but if you are in the second case, you arguably don't need replication,
>> and would be better served by improving the reliability of your underlying
>> storage stack than trying to work around it's problems. Even in that case,
>> your overall reliability is still constrained by the least reliable
>> component (in more idiomatic terms 'a chain is only as strong as it's
>> weakest link').
>
> MD can handle this case well, there's no reason btrfs shouldn't do that too.
> A RAID is not akin to serially connected chain, it's a parallel connected
> chain: while pieces of the broken second chain hanging down from the first
> don't make it strictly more resilient than having just a single chain, in
> general case it _is_ more reliable even if the other chain is weaker.
>
> Don't we have a patchset that deals with marking a device as failed at
> runtime floating on the mailing list?  I did not look at those patches yet,
> but they are a step in this direction.
>
>> Using replication with a reliable device and a questionable device is
>> essentially the same as trying to add redundancy to a machine by adding an
>> extra linkage that doesn't always work and can get in the way of the main
>> linkage it's supposed to be protecting from failure.  Yes, it will work most
>> of the time, but the system is going to be less reliable than it is without
>> the 'redundancy'.
>
> That's the current state of btrfs, but the design is sound, and reaching
> more than parity with MD is a matter of implementation.
>
>> > Thus, I switched the machine to NBD (albeit it sucks on 100Mbit eth).  
>> > Alas,
>> > the network driver allocates memory with GFP_NOIO which causes NBD
>> > disconnects (somehow, this doesn't ever happen on swap where GFP_NOIO would
>> > be obvious but on regular filesystem where throwing out userspace memory is
>> > safe).  The disconnects happen around once per week.
>> Somewhat 

Re: Is it safe to use btrfs on top of different types of devices?

2017-10-17 Thread Adam Borowski
On Tue, Oct 17, 2017 at 03:19:09PM -0400, Austin S. Hemmelgarn wrote:
> On 2017-10-17 13:06, Adam Borowski wrote:
> > On Tue, Oct 17, 2017 at 08:40:20AM -0400, Austin S. Hemmelgarn wrote:
> > > On 2017-10-17 07:42, Zoltan wrote:
> > > > On Tue, Oct 17, 2017 at 1:26 PM, Austin S. Hemmelgarn
> > > >  wrote:
> > > > 
> > > > > I forget sometimes that people insist on storing large volumes of 
> > > > > data on
> > > > > unreliable storage...
> > > > 
> > > > In my opinion the unreliability of the storage is the exact reason for
> > > > wanting to use raid1. And I think any problem one encounters with an
> > > > unreliable disk can likely happen with more reliable ones as well,
> > > > only less frequently, so if I don't feel comfortable using raid1 on an
> > > > unreliable medium then I wouldn't trust it on a more reliable one
> > > > either.
> > 
> > > The thing is that you need some minimum degree of reliability in the other
> > > components in the storage stack for it to be viable to use any given 
> > > storage
> > > technology.  If you don't meet that minimum degree of reliability, then 
> > > you
> > > can't count on the reliability guarantees of the storage technology.
> > 
> > The thing is, reliability guarantees required vary WILDLY depending on your
> > particular use cases.  On one hand, there's "even an one-minute downtime
> > would cost us mucho $$$s, can't have that!" -- on the other, "it died?
> > Okay, we got backups, lemme restore it after the weekend".
> Yes, but if you are in the second case, you arguably don't need replication,
> and would be better served by improving the reliability of your underlying
> storage stack than trying to work around it's problems. Even in that case,
> your overall reliability is still constrained by the least reliable
> component (in more idiomatic terms 'a chain is only as strong as it's
> weakest link').

MD can handle this case well, there's no reason btrfs shouldn't do that too.
A RAID is not akin to serially connected chain, it's a parallel connected
chain: while pieces of the broken second chain hanging down from the first
don't make it strictly more resilient than having just a single chain, in
general case it _is_ more reliable even if the other chain is weaker.

Don't we have a patchset that deals with marking a device as failed at
runtime floating on the mailing list?  I did not look at those patches yet,
but they are a step in this direction.

> Using replication with a reliable device and a questionable device is
> essentially the same as trying to add redundancy to a machine by adding an
> extra linkage that doesn't always work and can get in the way of the main
> linkage it's supposed to be protecting from failure.  Yes, it will work most
> of the time, but the system is going to be less reliable than it is without
> the 'redundancy'.

That's the current state of btrfs, but the design is sound, and reaching
more than parity with MD is a matter of implementation.

> > Thus, I switched the machine to NBD (albeit it sucks on 100Mbit eth).  Alas,
> > the network driver allocates memory with GFP_NOIO which causes NBD
> > disconnects (somehow, this doesn't ever happen on swap where GFP_NOIO would
> > be obvious but on regular filesystem where throwing out userspace memory is
> > safe).  The disconnects happen around once per week.
> Somewhat off-topic, but you might try looking at ATAoE as an alternative,
> it's more reliable in my experience (if you've got a reliable network),
> gives better performance (there's less protocol overhead than NBD, and it
> runs on top of layer 2 instead of layer 4)

I've tested it -- not on the Odroid-U2 but on Pine64 (fully working GbE). 
NBD delivers 108MB/sec in a linear transfer, ATAoE is lucky to break
40MB/sec, same target (Qnap-253a, spinning rust), both in default
configuration without further tuning.  NBD is over IPv6 for that extra 20
bytes per packet overhead.

Also, NBD can be encrypted or arbitrarily routed.

> > It's a single-device filesystem, thus disconnects are obviously fatal.  But,
> > they never caused even a single bit of damage (as scrub goes), thus proving
> > btrfs handles this kind of disconnects well.  Unlike times past, the kernel
> > doesn't get confused thus no reboot is needed, merely an unmount, "service
> > nbd-client restart", mount, restart the rebuild jobs.
> That's expected behavior though.  _Single_ device BTRFS has nothing to get
> out of sync most of the time, the only time there's any possibility of an
> issue is when you die after writing the first copy of a block that's in a
> dup profile chunk, but even that is not very likely to cause problems
> (you'll just lose at most the last  worth of data).

How come?  In a DUP profile, the writes are: chunk 1, chunk2, barrier,
superblock.  The two prior writes may be arbitrarily reordered -- both
between each other or even individual sectors inside the chunks, but unless
the disk lies about barriers, there's no 

Re: Is it safe to use btrfs on top of different types of devices?

2017-10-17 Thread Austin S. Hemmelgarn

On 2017-10-17 13:06, Adam Borowski wrote:

On Tue, Oct 17, 2017 at 08:40:20AM -0400, Austin S. Hemmelgarn wrote:

On 2017-10-17 07:42, Zoltan wrote:

On Tue, Oct 17, 2017 at 1:26 PM, Austin S. Hemmelgarn
 wrote:


I forget sometimes that people insist on storing large volumes of data on
unreliable storage...


In my opinion the unreliability of the storage is the exact reason for
wanting to use raid1. And I think any problem one encounters with an
unreliable disk can likely happen with more reliable ones as well,
only less frequently, so if I don't feel comfortable using raid1 on an
unreliable medium then I wouldn't trust it on a more reliable one
either.



The thing is that you need some minimum degree of reliability in the other
components in the storage stack for it to be viable to use any given storage
technology.  If you don't meet that minimum degree of reliability, then you
can't count on the reliability guarantees of the storage technology.


The thing is, reliability guarantees required vary WILDLY depending on your
particular use cases.  On one hand, there's "even an one-minute downtime
would cost us mucho $$$s, can't have that!" -- on the other, "it died?
Okay, we got backups, lemme restore it after the weekend".
Yes, but if you are in the second case, you arguably don't need 
replication, and would be better served by improving the reliability of 
your underlying storage stack than trying to work around it's problems. 
Even in that case, your overall reliability is still constrained by the 
least reliable component (in more idiomatic terms 'a chain is only as 
strong as it's weakest link').


Using replication with a reliable device and a questionable device is 
essentially the same as trying to add redundancy to a machine by adding 
an extra linkage that doesn't always work and can get in the way of the 
main linkage it's supposed to be protecting from failure.  Yes, it will 
work most of the time, but the system is going to be less reliable than 
it is without the 'redundancy'.


Lemme tell you a btrfs blockdev disconnects story.
I have an Odroid-U2, a cheap ARM SoC that, despite being 5 years old and
costing mere $79 (+$89 eMMC...) still beats the performance of much newer
SoCs that have far better theoretical specs, including subsequent Odroids.
After ~1.5 year of CPU-bound stress tests for one program, I switched this
machine to doing Debian package rebuilds, 24/7/365¼, for QA purposes.
Being a moron, I did not realize until pretty late that high parallelism to
keep all cores utilized is still a net performance loss when a memory-hungry
package goes into a swappeathon, even despite the latter being fairly rare.
Thus, I can say disk utilization was pretty much 100%, with almost as much
writing as reading.  The eMMC card endured all of this until very recently
(nowadays it sadly throws errors from time to time).

Thus, I switched the machine to NBD (albeit it sucks on 100Mbit eth).  Alas,
the network driver allocates memory with GFP_NOIO which causes NBD
disconnects (somehow, this doesn't ever happen on swap where GFP_NOIO would
be obvious but on regular filesystem where throwing out userspace memory is
safe).  The disconnects happen around once per week.
Somewhat off-topic, but you might try looking at ATAoE as an 
alternative, it's more reliable in my experience (if you've got a 
reliable network), gives better performance (there's less protocol 
overhead than NBD, and it runs on top of layer 2 instead of layer 4), 
and you can even boot with an ATAoE device as root without needing an 
initramfs if you have network auto-configuration in the kernel.  The 
generic server-side component is called 'vblade', and you actually don't 
need anything on the client side other than the `aoe` kernel module 
(loading the module scans for devices automatically, and you can easily 
manage things through the various nodes it creates in /dev).


It's a single-device filesystem, thus disconnects are obviously fatal.  But,
they never caused even a single bit of damage (as scrub goes), thus proving
btrfs handles this kind of disconnects well.  Unlike times past, the kernel
doesn't get confused thus no reboot is needed, merely an unmount, "service
nbd-client restart", mount, restart the rebuild jobs.
That's expected behavior though.  _Single_ device BTRFS has nothing to 
get out of sync most of the time, the only time there's any possibility 
of an issue is when you die after writing the first copy of a block 
that's in a dup profile chunk, but even that is not very likely to cause 
problems (you'll just lose at most the last  worth of 
data).  The moment you add another device though, that simplicity goes 
out the window.


I also can recreate this filesystem and the build environment on it with
just a few commands, thus, unlike /, there's no need for backups.  But I
had no need to recreate it yet.

This is single-device not RAID5, but it's a good example for an use case
where an 

Re: Is it safe to use btrfs on top of different types of devices?

2017-10-17 Thread Adam Borowski
On Tue, Oct 17, 2017 at 08:40:20AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-10-17 07:42, Zoltan wrote:
> > On Tue, Oct 17, 2017 at 1:26 PM, Austin S. Hemmelgarn
> >  wrote:
> > 
> > > I forget sometimes that people insist on storing large volumes of data on
> > > unreliable storage...
> > 
> > In my opinion the unreliability of the storage is the exact reason for
> > wanting to use raid1. And I think any problem one encounters with an
> > unreliable disk can likely happen with more reliable ones as well,
> > only less frequently, so if I don't feel comfortable using raid1 on an
> > unreliable medium then I wouldn't trust it on a more reliable one
> > either.

> The thing is that you need some minimum degree of reliability in the other
> components in the storage stack for it to be viable to use any given storage
> technology.  If you don't meet that minimum degree of reliability, then you
> can't count on the reliability guarantees of the storage technology.

The thing is, reliability guarantees required vary WILDLY depending on your
particular use cases.  On one hand, there's "even an one-minute downtime
would cost us mucho $$$s, can't have that!" -- on the other, "it died? 
Okay, we got backups, lemme restore it after the weekend".

Lemme tell you a btrfs blockdev disconnects story.
I have an Odroid-U2, a cheap ARM SoC that, despite being 5 years old and
costing mere $79 (+$89 eMMC...) still beats the performance of much newer
SoCs that have far better theoretical specs, including subsequent Odroids.
After ~1.5 year of CPU-bound stress tests for one program, I switched this
machine to doing Debian package rebuilds, 24/7/365¼, for QA purposes.
Being a moron, I did not realize until pretty late that high parallelism to
keep all cores utilized is still a net performance loss when a memory-hungry
package goes into a swappeathon, even despite the latter being fairly rare.
Thus, I can say disk utilization was pretty much 100%, with almost as much
writing as reading.  The eMMC card endured all of this until very recently
(nowadays it sadly throws errors from time to time).

Thus, I switched the machine to NBD (albeit it sucks on 100Mbit eth).  Alas,
the network driver allocates memory with GFP_NOIO which causes NBD
disconnects (somehow, this doesn't ever happen on swap where GFP_NOIO would
be obvious but on regular filesystem where throwing out userspace memory is
safe).  The disconnects happen around once per week.

It's a single-device filesystem, thus disconnects are obviously fatal.  But,
they never caused even a single bit of damage (as scrub goes), thus proving
btrfs handles this kind of disconnects well.  Unlike times past, the kernel
doesn't get confused thus no reboot is needed, merely an unmount, "service
nbd-client restart", mount, restart the rebuild jobs.

I also can recreate this filesystem and the build environment on it with
just a few commands, thus, unlike /, there's no need for backups.  But I
had no need to recreate it yet.

This is single-device not RAID5, but it's a good example for an use case
where an unreliable storage medium is acceptable (even if the GFP_NOIO issue
is still worth fixing).


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Imagine there are bandits in your house, your kid is bleeding out,
⢿⡄⠘⠷⠚⠋⠀ the house is on fire, and seven big-ass trumpets are playing in the
⠈⠳⣄ sky.  Your cat demands food.  The priority should be obvious...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-17 Thread Austin S. Hemmelgarn

On 2017-10-17 07:42, Zoltan wrote:

Hi,

On Tue, Oct 17, 2017 at 1:26 PM, Austin S. Hemmelgarn
 wrote:


I forget sometimes that people insist on storing large volumes of data on
unreliable storage...


In my opinion the unreliability of the storage is the exact reason for
wanting to use raid1. And I think any problem one encounters with an
unreliable disk can likely happen with more reliable ones as well,
only less frequently, so if I don't feel comfortable using raid1 on an
unreliable medium then I wouldn't trust it on a more reliable one
either.
The thing is that you need some minimum degree of reliability in the 
other components in the storage stack for it to be viable to use any 
given storage technology.  If you don't meet that minimum degree of 
reliability, then you can't count on the reliability guarantees of the 
storage technology.


As a more concrete example, reliable operation of a RAID5 array requires 
a functional guarantee that a second disk won't fail during a rebuild of 
the array.  That fact gives an upper limit on the practical size of a 
RAID5 array as a function of the failure rate of the disks and the time 
it takes to rebuild the array.   More specifically, the array has to be 
small enough that it's a statistical impossibility that a second device 
will fail during a rebuild of the array.  Once the failure rate goes 
above a certain value proportionate to the rebuild time, you end up in a 
situation where you shouldn't be using RAID5 to protect against device 
failure because it won't provide any net benefit (and FWIW< we're past 
that point on most modern hardware when talking about typical enterprise 
scale storage requirements).


As of right now, a rate of device disconnects above 0 invalidates that 
reliability requirement for multi-device BTRFS (and it doesn't have to 
be much above zero to invalidate it for MD and LVM, it just needs to be 
high enough to make it statistically possible that it happens more than 
once during a rebuild), and thus USB is not something that should be 
considered a viable option for hosting a multi-device BTRFS array in 
most cases.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-17 Thread Zoltan
Hi,

On Tue, Oct 17, 2017 at 1:26 PM, Austin S. Hemmelgarn
 wrote:

> I forget sometimes that people insist on storing large volumes of data on
> unreliable storage...

In my opinion the unreliability of the storage is the exact reason for
wanting to use raid1. And I think any problem one encounters with an
unreliable disk can likely happen with more reliable ones as well,
only less frequently, so if I don't feel comfortable using raid1 on an
unreliable medium then I wouldn't trust it on a more reliable one
either.

Zoltan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-17 Thread Austin S. Hemmelgarn

On 2017-10-16 21:14, Adam Borowski wrote:

On Mon, Oct 16, 2017 at 01:27:40PM -0400, Austin S. Hemmelgarn wrote:

On 2017-10-16 12:57, Zoltan wrote:

On Mon, Oct 16, 2017 at 1:53 PM, Austin S. Hemmelgarn wrote:

In an ideal situation, scrubbing should not be an 'only if needed' thing,
even for a regular array that isn't dealing with USB issues. From a
practical perspective, there's no way to know for certain if a scrub is
needed short of reading every single file in the filesystem in it's
entirety, at which point, you're just better off running a scrub (because if
you _do_ need to scrub, you'll end up reading everything twice).



[...]  There are three things to deal with here:
1. Latent data corruption caused either by bit rot, or by a half-write (that
is, one copy got written successfully, then the other device disappeared
_before_ the other copy got written).
2. Single chunks generated when the array is degraded.
3. Half-raid1 chunks generated by newer kernels when the array is degraded.


Note that any of the above other than bit rot affect only very recent data.
If we keep record of the last known-good generation, all of that can be
enumerated, allowing us to make a selective scrub that checks only a small
part of the disk.  A linear read a 8TB disk takes 14 hours...

If we ever get auto-recovery, this is a fine candidate.
Indeed, and in fact I think that generational filtering may in fact be 
one of the easier performance improvements here too.



Scrub will fix problem 1 because that's what it's designed to fix.  it will
also fix problem 3, since that behaves just like problem 1 from a
higher-level perspective.  It won't fix problem 2 though, as it doesn't look
at chunk types (only if the data in the chunk doesn't have the correct
number of valid copies).


Here not even tracking generations is required: a soft convert balance
touches only bad chunks.  Again, would work well for auto-recovery, as it's
a no-op if all is well.
However, it would require some minor differences from the current 
balance command, as newer kernels (are supposed to) generate half-raid1 
chunks instead of single chunks, though that can also be fixed by scrub.



In contrast, the balance command you quoted won't fix issue 1 (because it
doesn't validate checksums or check that data has the right number of
copies), or issue 3 (because it's been told to only operate on non-raid1
chunks), but it will fix issue 2.

In comparison to both of the above, a full balance without filters will fix
all three issues, although it will do so less efficiently (in terms of both
time and disk usage) than running a soft-conversion balance followed by a
scrub.


"less efficiently" is an understatement.  Scrub gets a good part of
theoretical linear speed, while I just had a single metadata block take
14428 seconds to balance.

Yeah, the metadata especially can get pretty bad.



In the case of normal usage, device disconnects are rare, so you should
generally be more worried about latent data corruption.


Yeah, but certain setups (like anything USB) gets disconnect quite often.
It would be nice to get them right.  MD thanks to write-intent bitmap can
recover almost instantly, btrfs could do it better -- the code to do so
isn't written yet.
The write intent bitmap is also exponentially easier to implement than 
what's be needed for BTRFS.



monitor the kernel log to watch for device disconnects, remount the
filesystem when the device reconnects, and then run the balance command
followed by a scrub.  With most hardware I've seen, USB disconnects tend to
be relatively frequent unless you're using very high quality cabling and
peripheral devices.  If, however, they happen less than once a day most of
the time, just set up the log monitor to remount, and set the balance and
scrub commands on the schedule I suggested above for normal usage.


A day-long recovery for an event that happens daily isn't a particularly
enticing prospect.
I forget sometimes that people insist on storing large volumes of data 
on unreliable storage...

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-16 Thread Adam Borowski
On Mon, Oct 16, 2017 at 01:27:40PM -0400, Austin S. Hemmelgarn wrote:
> On 2017-10-16 12:57, Zoltan wrote:
> > On Mon, Oct 16, 2017 at 1:53 PM, Austin S. Hemmelgarn wrote:
> In an ideal situation, scrubbing should not be an 'only if needed' thing,
> even for a regular array that isn't dealing with USB issues. From a
> practical perspective, there's no way to know for certain if a scrub is
> needed short of reading every single file in the filesystem in it's
> entirety, at which point, you're just better off running a scrub (because if
> you _do_ need to scrub, you'll end up reading everything twice).

> [...]  There are three things to deal with here:
> 1. Latent data corruption caused either by bit rot, or by a half-write (that
> is, one copy got written successfully, then the other device disappeared
> _before_ the other copy got written).
> 2. Single chunks generated when the array is degraded.
> 3. Half-raid1 chunks generated by newer kernels when the array is degraded.

Note that any of the above other than bit rot affect only very recent data. 
If we keep record of the last known-good generation, all of that can be
enumerated, allowing us to make a selective scrub that checks only a small
part of the disk.  A linear read a 8TB disk takes 14 hours...

If we ever get auto-recovery, this is a fine candidate.

> Scrub will fix problem 1 because that's what it's designed to fix.  it will
> also fix problem 3, since that behaves just like problem 1 from a
> higher-level perspective.  It won't fix problem 2 though, as it doesn't look
> at chunk types (only if the data in the chunk doesn't have the correct
> number of valid copies).

Here not even tracking generations is required: a soft convert balance
touches only bad chunks.  Again, would work well for auto-recovery, as it's
a no-op if all is well.

> In contrast, the balance command you quoted won't fix issue 1 (because it
> doesn't validate checksums or check that data has the right number of
> copies), or issue 3 (because it's been told to only operate on non-raid1
> chunks), but it will fix issue 2.
> 
> In comparison to both of the above, a full balance without filters will fix
> all three issues, although it will do so less efficiently (in terms of both
> time and disk usage) than running a soft-conversion balance followed by a
> scrub.

"less efficiently" is an understatement.  Scrub gets a good part of
theoretical linear speed, while I just had a single metadata block take
14428 seconds to balance.

> In the case of normal usage, device disconnects are rare, so you should
> generally be more worried about latent data corruption.

Yeah, but certain setups (like anything USB) gets disconnect quite often. 
It would be nice to get them right.  MD thanks to write-intent bitmap can
recover almost instantly, btrfs could do it better -- the code to do so
isn't written yet.

> monitor the kernel log to watch for device disconnects, remount the
> filesystem when the device reconnects, and then run the balance command
> followed by a scrub.  With most hardware I've seen, USB disconnects tend to
> be relatively frequent unless you're using very high quality cabling and
> peripheral devices.  If, however, they happen less than once a day most of
> the time, just set up the log monitor to remount, and set the balance and
> scrub commands on the schedule I suggested above for normal usage.

A day-long recovery for an event that happens daily isn't a particularly
enticing prospect.

-- 
⢀⣴⠾⠻⢶⣦⠀ Meow!
⣾⠁⢰⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ I was born a dumb, ugly and work-loving kid, then I got swapped on
⠈⠳⣄ the maternity ward.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-16 Thread Austin S. Hemmelgarn

On 2017-10-16 12:57, Zoltan wrote:

Hi,

On Mon, Oct 16, 2017 at 1:53 PM, Austin S. Hemmelgarn wrote:


you will need to scrub regularly to avoid data corruption


Is there any indication that a scrub is needed? Before actually doing
a scrub, is btrfs already aware that one of the devices did not
receive all data due to being unavailable for a brief time? If so,
which command shows this info in its output?
In an ideal situation, scrubbing should not be an 'only if needed' 
thing, even for a regular array that isn't dealing with USB issues. 
From a practical perspective, there's no way to know for certain if a 
scrub is needed short of reading every single file in the filesystem in 
it's entirety, at which point, you're just better off running a scrub 
(because if you _do_ need to scrub, you'll end up reading everything twice).


If you insist on spot-checking things, you can check the output of 
`btrfs device stats` for the filesystem.  If any numbers there are 
non-zero, then some file that you've accessed _since the last time you 
reset the counters_ has corruption.  If you go this route, make sure to 
reset the counters with `btrfs device stats -z` _immediately_ after you 
run a scrub, or in some way track their values externally to compare 
against.


Additionally, how does btrfs scrub compare to btrfs balance
-dconvert=raid1,soft -mconvert=raid1,soft in this scenario? I would
suppose that if btrfs is aware that some data does not have a
replication count of 2, then a convert could fix that without a scrub
reading through the whole disk. On the other hand, while I would
expect btrfs scrub to find data with bad checksum, I would not expect
it do balance as well in order to achieve the desired replication
count of 2 for all data. So do I need to run both a scrub and a
convert, or is a scrub enough?

It kind of depends.  There are three things to deal with here:
1. Latent data corruption caused either by bit rot, or by a half-write 
(that is, one copy got written successfully, then the other device 
disappeared _before_ the other copy got written).

2. Single chunks generated when the array is degraded.
3. Half-raid1 chunks generated by newer kernels when the array is degraded.

Scrub will fix problem 1 because that's what it's designed to fix.  it 
will also fix problem 3, since that behaves just like problem 1 from a 
higher-level perspective.  It won't fix problem 2 though, as it doesn't 
look at chunk types (only if the data in the chunk doesn't have the 
correct number of valid copies).


In contrast, the balance command you quoted won't fix issue 1 (because 
it doesn't validate checksums or check that data has the right number of 
copies), or issue 3 (because it's been told to only operate on non-raid1 
chunks), but it will fix issue 2.


In comparison to both of the above, a full balance without filters will 
fix all three issues, although it will do so less efficiently (in terms 
of both time and disk usage) than running a soft-conversion balance 
followed by a scrub.


In the case of normal usage, device disconnects are rare, so you should 
generally be more worried about latent data corruption.  As a result, 
for most normal users, I would suggest running the balance command you 
gave daily (it will usually finish instantly, so there's no point in not 
running it frequently to help ensure data safety) and a scrub daily or 
weekly (this is the one that matters more here, since you need to worry 
more about latent data corruption).


For your use case though, I would instead suggest setting something up 
to monitor the kernel log to watch for device disconnects, remount the 
filesystem when the device reconnects, and then run the balance command 
followed by a scrub.  With most hardware I've seen, USB disconnects tend 
to be relatively frequent unless you're using very high quality cabling 
and peripheral devices.  If, however, they happen less than once a day 
most of the time, just set up the log monitor to remount, and set the 
balance and scrub commands on the schedule I suggested above for normal 
usage.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-16 Thread Zoltan
Hi,

On Mon, Oct 16, 2017 at 1:53 PM, Austin S. Hemmelgarn wrote:

> you will need to scrub regularly to avoid data corruption

Is there any indication that a scrub is needed? Before actually doing
a scrub, is btrfs already aware that one of the devices did not
receive all data due to being unavailable for a brief time? If so,
which command shows this info in its output?

Additionally, how does btrfs scrub compare to btrfs balance
-dconvert=raid1,soft -mconvert=raid1,soft in this scenario? I would
suppose that if btrfs is aware that some data does not have a
replication count of 2, then a convert could fix that without a scrub
reading through the whole disk. On the other hand, while I would
expect btrfs scrub to find data with bad checksum, I would not expect
it do balance as well in order to achieve the desired replication
count of 2 for all data. So do I need to run both a scrub and a
convert, or is a scrub enough?

Thanks,

Zoltan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-16 Thread Austin S. Hemmelgarn

On 2017-10-15 04:30, Zoltán Ivánfi wrote:

Hi,

Thanks for the replies.

As you both pointed out, I shouldn't have described the issue having
to do with hotplugging. I got confused by this use-case being somewhat
emphasized in the description of the bug I linked to. As for the
question of why I think that I got bitten by that bug in particular:
It matches my experiences as I can recall them (used RAID on SATA+USB,
got bio too big device error messages, data got corrupted).
The actual issue isn't exactly just USB, but externally connected 
hot-pluggable devices on unreliable buses, which mostly consists of USB, 
but also includes things like IEEE 1394 and eSATA (although eSATA isn't 
always an issue, it depends on the cabling and the physical connections).


You assumed correctly that what I really wanted to ask about was btrfs
on SATA+USB, thanks for answering that questions as well. Based on
your replies I feel assured that btrfs should not be affected by this
particular issue due to operating on the filesystem level and not on
the block device level; but USB connectivity issues can still lead to
problems.
Indeed, and I would in fact personally recommend against using USB for 
any kind of always connected persistent storage if at all possible. 
Even if the USB controller, storage controller, and cabling are 100% 
reliable, you still run a pretty significant risk of issues just from 
bumping the system because USB-A and USB-C connectors do not provide 
particularly solid connections from a mechanical perspective.


Do these USB connectivity issues lead to data corruption? Naturally
for raid0 they will, but for raid1 I suppose they shouldn't as one
copy of the data remains intact.
For raid1, they _normally_ won't cause data corruption with BTRFS, 
provided you ensure proper maintenance of the filesystem.  They will 
however cause performance issues, even if you're using USB 3.0 or 3.1, 
as you will end up re-writing bits and pieces of the filesystem 
regularly (because you will need to scrub regularly to avoid data 
corruption).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-15 Thread Duncan
Zoltán Ivánfi posted on Sun, 15 Oct 2017 10:30:52 +0200 as excerpted:

> You assumed correctly that what I really wanted to ask about was btrfs
> on SATA+USB, thanks for answering that questions as well. Based on your
> replies I feel assured that btrfs should not be affected by this
> particular issue due to operating on the filesystem level and not on the
> block device level; but USB connectivity issues can still lead to
> problems.
> 
> Do these USB connectivity issues lead to data corruption? Naturally for
> raid0 they will, but for raid1 I suppose they shouldn't as one copy of
> the data remains intact.

FWIW, the problems I've /personally/ had with raid1 here, on both mdraid 
and btrfs raid, 100% SATA connections on both, with older SATA-1 spinning 
rust for the mdraid, and newer SSDs for btrfs raid, have to do with 
suspend to RAM, or hibernate (suspend to disk).  I finally gave up on 
it.  The problem is that in resume, one device is inevitably slower than 
the others to come back up, and will often get kicked from the array.

On original bootup there's a mechanism that waits for all devices, but 
apparently it's not activated on resume from suspend-to-RAM.  With mdraid 
(longer ago, with a machine that would hibernate but not suspend to ram) 
the device can be readded and will resync.  Btrfs (as of a couple years 
ago anyway, with a machine that would suspend to RAM but I've not 
actually tried hibernate as I've not configured a swap partition to 
suspend to) remains a bit behind in that area, however, and the slow 
device remains unsynced, eventually forcing a full reboot, where it comes 
up with the raid, but still must be manually synced via a btrfs scrub.  
Since I end up having to reboot and do a manual sync anyway, it's simply 
not worth doing suspend-to-ram in the first place, and I've started just 
shutting down or leaving the machine running.  That seems to work much 
better for both cases.

I've not personally tried raid1 (of either btrfs or mdraid) on USB, so I 
have no personal experience there, but as I said, we do get more reports 
of problems with USB-connected btrfs raid, than with SATA.  Most of the 
problems are fixable, and the reports have lessened as btrfs has matured, 
but I'd not recommend it or consider it worth the hassle.

What I'd recommend instead, if USB connectivity is all that's available 
(as with many appliance-type machines, my router, for instance, tho I'm 
not actually using the feature there), is larger capacity, then use btrfs 
in dup mode so it gets to use btrfs checksumming not just for error 
detection, but correction as well (a big advantage of both raid1 and dup 
mode), and do actual backups to other devices.  (Btrfs send/receive can 
be used for the backups, tho here I just alternate backups and use a 
simpler mkfs and midnight-commander copying with btrfs lzo compression.)

I tend to heavily partition and use smaller, independent btrfs anyway, 
over the huge multi-TB single btrfs that other people seem to favor, so a 
4 TB single device in dup mode for 2 TB capacity is larger by an order of 
magnitude than any of my filesystems (I'd certainly partition up anything 
that big, even for dup mode), tho I can imagine a 4 TB device in dup mode 
for 2 TB capacity would cramp the style of some users.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-15 Thread Zoltán Ivánfi
Hi,

Thanks for the replies.

As you both pointed out, I shouldn't have described the issue having
to do with hotplugging. I got confused by this use-case being somewhat
emphasized in the description of the bug I linked to. As for the
question of why I think that I got bitten by that bug in particular:
It matches my experiences as I can recall them (used RAID on SATA+USB,
got bio too big device error messages, data got corrupted).

You assumed correctly that what I really wanted to ask about was btrfs
on SATA+USB, thanks for answering that questions as well. Based on
your replies I feel assured that btrfs should not be affected by this
particular issue due to operating on the filesystem level and not on
the block device level; but USB connectivity issues can still lead to
problems.

Do these USB connectivity issues lead to data corruption? Naturally
for raid0 they will, but for raid1 I suppose they shouldn't as one
copy of the data remains intact.

Thanks,

Zoltan

On Sat, Oct 14, 2017 at 9:00 PM, Zoltán Ivánfi  wrote:
> Dear Btrfs Experts,
>
> A few years ago I tried to use a RAID1 mdadm array of a SATA and a USB
> disk, which lead to strange error messages and data corruption. I did
> some searching back then and found out that using hot-pluggable
> devices with mdadm is a paved road to data corruption. Reading through
> that old bug again I see that it was autoclosed due to old age but
> still hasn't been addressed:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/320638
>
> I would like to ask whether btrfs may also be prone to data corruption
> issues in this scenario (due to the same underlying issue as the one
> described in the bug above for mdadm), or is btrfs unaffected by the
> underlying issue and is safe to use with a mix of regular and
> hot-pluggable devices as well?
>
> Thanks,
>
> Zoltan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-14 Thread Duncan
Zoltán Ivánfi posted on Sat, 14 Oct 2017 21:00:26 +0200 as excerpted:

> Dear Btrfs Experts,
> 
> A few years ago I tried to use a RAID1 mdadm array of a SATA and a USB
> disk, which lead to strange error messages and data corruption. I did
> some searching back then and found out that using hot-pluggable devices
> with mdadm is a paved road to data corruption. Reading through that old
> bug again I see that it was autoclosed due to old age but still hasn't
> been addressed:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/320638

As pg suggests that bug really doesn't have much to do with hotplug.

Further, SATA is hotplug capable, particularly its ESATA variant (tho 
some specific SATA hardware or drivers may not be, and I believe it's an 
optional feature in the early SATA versions at least), and of course so 
is USB, and SCSI should be as well (tho I know rather less of it), so not 
being able to use hot-pluggable devices with mdadm would make it rather 
useless...

And I take issue with the bug's assertion that "upstream" doesn't have a 
bug tracker as well, as that's an Ubuntu bug, and certainly the mainline 
kernel has a bug tracker, at https://bugzilla.kernel.org , which I've 
been using to report bugs on development kernels for years, since at 
least 2.6.15 in 2006 (the bug above was 2009).  And that definitely was a 
kernelspace bug, tho I'd suggest it wasn't in mdadm so much as in the 
layers on top of it, since by my read mdadm was dealing with and passing 
thru the new due-to-hotplugging values just fine, it was the layers on 
top not dealing with the new values.

Now some kernel projects make more use of the kernel bugzilla than 
others, but there's someone responsible for getting the right people 
involved if necessary, and discussion can and does sometimes move to the 
list after that.  But the bug tracker is certainly there, and certainly 
usable, because as I said I've been using it to report and get fixed bugs 
I've found testing new kernels, since well before that ubuntu bug was 
filed.

And FWIW, btrfs does definitely use the kernel bugzilla, encouraging 
people to file bugs there so they don't drop thru the cracks on the 
mailing list, tho discussion here is useful as well, with links from one 
to the other so people can find both starting with either one.

> I would like to ask whether btrfs may also be prone to data corruption
> issues in this scenario (due to the same underlying issue as the one
> described in the bug above for mdadm), or is btrfs unaffected by the
> underlying issue and is safe to use with a mix of regular and
> hot-pluggable devices as well?

As mentioned, the issue reported there isn't really a hotplug issue...

Meanwhile, I'm not sure about max_sectors changing, but perhaps the 
question you meant to ask was about mixing USB and SATA.

In general, there seems to be a lot of problems reported by people using 
USB for multi-device btrfs connections, due to various issues with USB 
connection reliability.  The problems are often power related, and can be 
traced to anything from trying to drive too much from a limited-current 
port instead of using external power devices, to tripping over the power 
or USB cord and unplugging it in the middle of a write, to performance 
issues due to sharing the USB with other devices, to bad cables or 
otherwise bad connections corrupting the data on its way to the device, 
to...

As a result, I'd suggest, if you can, do SATA or other more robust 
connections for multi-device btrfs, and if you do need to do USB 
connected btrfs, use externally powered devices, and keep to single-
device btrfs.

(Tho a several GiB USB thumb-drive added temporarily can be handy to to 
get out of a bind if you're getting ENOSPC on a balance you're trying to 
do to consolidate chunk usage and free unused space back to chunk-
unallocated.  There's a FAQ entry that discusses that.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-14 Thread Peter Grandi
> A few years ago I tried to use a RAID1 mdadm array of a SATA
> and a USB disk, which lead to strange error messages and data
> corruption.

That's common, quite a few reports of similar issues in previous
entries in this mailing list and for many other filesystems.

> I did some searching back then and found out that using
> hot-pluggable devices with mdadm is a paved road to data
> corruption.

That's an amazing jump of logic.

> Reading through that old bug again I see that it was
> autoclosed due to old age but still hasn't been addressed:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/320638

I suspect that it is very easy to misinterpret what the reported
issue is. However it is an interesting corner case what could
happen with any type of hardware device, not just hot-pluggable,
and one that I will try to remember even if unlikely to occur in
practice. I was only aware (dimly) of something quite similar in
the case of different logical sector sizes.

> I would like to ask whether btrfs may also be prone to data
> corruption issues in this scenario

Btrfs like (nearly) all UNIX/Linux filesystems does not run on
top of "devices", but on top of "files" of type "block device".

If the block device abstraction layer and lower layers work
correctly, Btrfs does not have problems of that sort when adding
new devices; conversely if the block device layer and lower
layers do not work correctly, no mainline Linux filesystem I know
can cope with that.

Note: "work correctly" does not mean "work error-free".

> (due to the same underlying issue as the one described in the
> bug above for mdadm), or is btrfs unaffected by the underlying
> issue

"Socratic method" questions:

* What do you think is the underlying issue in that bug report?
  (hint: something to do with host adapters or device bridges)
* Why do you think that bug report is in any way related to your
  issues with "a RAID1 mdadm array of a SATA and a USB disk"?

> and is safe to use with a mix of regular and hot-pluggable
> devices as well?

In my experience Btrfs works very well with a set of block
devices abstracting over on both regular and hot-pluggable
device, as far as that goes.

I personally don't like relying on Btrfs multi-device volumes,
but that has nothing to do with your concerns, but with basic
Btrfs multi-device handling design choices.

If you have concerns about the reliability of specific storage
and system configurations you should become or find a system
integration and qualification engineer who understand the many
subletities of storage devices and device-system interconnects
and who would run extensive tests on it; storage and system
commissioning is often far from trivial even in seemingly simple
cases, due in part to the enormous complexity of interfaces, even
when they have few bugs, and test made with one combination often
do not have the same results even on apparently similar
combinations.

I suspect that you should have asked a completely different set
of questions (XY problem), but the above are I think good answers
to the questions that you have actually asked.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Is it safe to use btrfs on top of different types of devices?

2017-10-14 Thread Zoltán Ivánfi
Dear Btrfs Experts,

A few years ago I tried to use a RAID1 mdadm array of a SATA and a USB
disk, which lead to strange error messages and data corruption. I did
some searching back then and found out that using hot-pluggable
devices with mdadm is a paved road to data corruption. Reading through
that old bug again I see that it was autoclosed due to old age but
still hasn't been addressed:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/320638

I would like to ask whether btrfs may also be prone to data corruption
issues in this scenario (due to the same underlying issue as the one
described in the bug above for mdadm), or is btrfs unaffected by the
underlying issue and is safe to use with a mix of regular and
hot-pluggable devices as well?

Thanks,

Zoltan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html