Re: Mis-Design of Btrfs?

2011-07-15 Thread NeilBrown
On Thu, 14 Jul 2011 21:58:46 -0700 (PDT) da...@lang.hm wrote:

 On Thu, 14 Jul 2011, Chris Mason wrote:
 
  Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:
  On 07/14/2011 07:38 AM, NeilBrown wrote:
  On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheelerrwhee...@redhat.com  
  wrote:
 
  I'm certainly open to suggestions and collaboration.  Do you have in 
  mind any
  particular way to make the interface richer??
 
  NeilBrown
  Hi Neil,
 
  I know that Chris has a very specific set of use cases for btrfs and 
  think that
  Alasdair and others have started to look at what is doable.
 
  The obvious use case is the following:
 
  If a file system uses checksumming or other data corruption detection 
  bits, it
  can detect that it got bad data on a write. If that data was protected 
  by RAID,
  it would like to ask the block layer to try to read from another mirror 
  (for
  raid1) or try to validate/rebuild from parity.
 
  Today, I think that a retry will basically just give us back a random 
  chance of
  getting data from a different mirror or the same one that we got data 
  from on
  the first go.
 
  Chris, Alasdair, was that a good summary of one concern?
 
  Thanks!
 
  Ric
  I imagine a new field in 'struct bio' which was normally zero but could be
  some small integer.  It is only meaningful for read.
  When 0 it means get this data way you like.
  When non-zero it means get this data using method N, where the different
  methods are up to the device.
 
  For a mirrored RAID, method N means read from device N-1.
  For stripe/parity RAID, method 1 means use other data blocks and parity
  blocks to reconstruct data.
 
  The default for non RAID devices is to return EINVAL for any N  0.
  A remapping device (dm-linear, dm-stripe etc) would just pass the number
  down.  I'm not sure how RAID1 over RAID5 would handle it... that might 
  need
  some thought.
 
  So if btrfs reads a block and the checksum looks wrong, it reads again 
  with
  a larger N.  It continues incrementing N and retrying until it gets a 
  block
  that it likes or it gets EINVAL.  There should probably be an error code
  (EAGAIN?) which means I cannot work with that number, but try the next 
  one.
 
  It would be trivial for me to implement this for RAID1 and RAID10, and
  relatively easy for RAID5.
  I'd need to give a bit of thought to RAID6 as there are possibly multiple
  ways to reconstruct from different combinations of parity and data.  I'm 
  not
  sure if there would be much point in doing that though.
 
  It might make sense for a device to be able to report what the maximum
  'N' supported is... that might make stacked raid easier to manage...
 
  NeilBrown
 
 
  I think that the above makes sense. Not sure what your 0 definition is, 
  but I
  would assume that for non-raided devices (i.e., a single s-ata disk), 0 
  would
  be an indication that there is nothing more that can be tried. The data 
  you got
  is all you are ever going to get :)
 
  For multiple mirrors, you might want to have a scheme where you would be 
  able to
  cycle through the mirrors. You could retry, cycling through the various 
  mirrors
  until you have tried and returned them all at which point you would get a 
  no
  more error back or some such thing.
 
  Hi everyone,
 
  The mirror number idea is basically what btrfs does today, and I think
  it fits in with Neil's idea to have different policies for different
  blocks.
 
  Basically what btrfs does:
 
  read_block(block_num, mirror = 0)
  if (no io error and not csum error)
  horray()
 
  num_mirrors = get_num_copies(block number)
  for (i = 1; i  num_mirrors; i++) {
  read_block(block_num, mirror = i);
  }
 
  In a stacked configuration, the get_num_copies function can be smarter,
  basically adding up all the copies of the lower levels and finding a way
  to combine them.  We could just send down a fake bio that is responsible
  for adding up the storage combinations into a bitmask or whatever works.
 
  We could also just keep retrying until the lower layers reported no more
  mirror were available.
 
  In btrfs at least, we don't set the data blocks up to date until the crc
  has passed, so replacing bogus blocks is easy.  Metadata is a little
  more complex, but that's not really related to this topic.
 
  mirror number 0 just means no mirror preference/pick the fastest
  mirror to the btrfs block mapping code.
 
  Btrfs has the concept of different raid levels for different logical
  block numbers, so you get_num_copies might return one answer for block A
  and a different answer for block B.
 
  Either way, we could easily make use of a bio field here if it were
  exported out.
 
 you don't want to just pass the value down as that will cause problems 
 with layering (especially if the lower layer supports more values than a 
 higher layer)
 
 I would suggest that each layer take the value it's give, do an integer 
 division by the number of values that 

Re: Mis-Design of Btrfs?

2011-07-15 Thread Chris Mason
Excerpts from NeilBrown's message of 2011-07-15 02:33:54 -0400:
 On Thu, 14 Jul 2011 21:58:46 -0700 (PDT) da...@lang.hm wrote:
 
  On Thu, 14 Jul 2011, Chris Mason wrote:
  
   Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:
   On 07/14/2011 07:38 AM, NeilBrown wrote:
   On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheelerrwhee...@redhat.com  
   wrote:
  
   I'm certainly open to suggestions and collaboration.  Do you have in 
   mind any
   particular way to make the interface richer??
  
   NeilBrown
   Hi Neil,
  
   I know that Chris has a very specific set of use cases for btrfs and 
   think that
   Alasdair and others have started to look at what is doable.
  
   The obvious use case is the following:
  
   If a file system uses checksumming or other data corruption detection 
   bits, it
   can detect that it got bad data on a write. If that data was protected 
   by RAID,
   it would like to ask the block layer to try to read from another 
   mirror (for
   raid1) or try to validate/rebuild from parity.
  
   Today, I think that a retry will basically just give us back a random 
   chance of
   getting data from a different mirror or the same one that we got data 
   from on
   the first go.
  
   Chris, Alasdair, was that a good summary of one concern?
  
   Thanks!
  
   Ric
   I imagine a new field in 'struct bio' which was normally zero but could 
   be
   some small integer.  It is only meaningful for read.
   When 0 it means get this data way you like.
   When non-zero it means get this data using method N, where the 
   different
   methods are up to the device.
  
   For a mirrored RAID, method N means read from device N-1.
   For stripe/parity RAID, method 1 means use other data blocks and parity
   blocks to reconstruct data.
  
   The default for non RAID devices is to return EINVAL for any N  0.
   A remapping device (dm-linear, dm-stripe etc) would just pass the number
   down.  I'm not sure how RAID1 over RAID5 would handle it... that might 
   need
   some thought.
  
   So if btrfs reads a block and the checksum looks wrong, it reads again 
   with
   a larger N.  It continues incrementing N and retrying until it gets a 
   block
   that it likes or it gets EINVAL.  There should probably be an error code
   (EAGAIN?) which means I cannot work with that number, but try the next 
   one.
  
   It would be trivial for me to implement this for RAID1 and RAID10, and
   relatively easy for RAID5.
   I'd need to give a bit of thought to RAID6 as there are possibly 
   multiple
   ways to reconstruct from different combinations of parity and data.  
   I'm not
   sure if there would be much point in doing that though.
  
   It might make sense for a device to be able to report what the maximum
   'N' supported is... that might make stacked raid easier to manage...
  
   NeilBrown
  
  
   I think that the above makes sense. Not sure what your 0 definition 
   is, but I
   would assume that for non-raided devices (i.e., a single s-ata disk), 
   0 would
   be an indication that there is nothing more that can be tried. The data 
   you got
   is all you are ever going to get :)
  
   For multiple mirrors, you might want to have a scheme where you would be 
   able to
   cycle through the mirrors. You could retry, cycling through the various 
   mirrors
   until you have tried and returned them all at which point you would get 
   a no
   more error back or some such thing.
  
   Hi everyone,
  
   The mirror number idea is basically what btrfs does today, and I think
   it fits in with Neil's idea to have different policies for different
   blocks.
  
   Basically what btrfs does:
  
   read_block(block_num, mirror = 0)
   if (no io error and not csum error)
   horray()
  
   num_mirrors = get_num_copies(block number)
   for (i = 1; i  num_mirrors; i++) {
   read_block(block_num, mirror = i);
   }
  
   In a stacked configuration, the get_num_copies function can be smarter,
   basically adding up all the copies of the lower levels and finding a way
   to combine them.  We could just send down a fake bio that is responsible
   for adding up the storage combinations into a bitmask or whatever works.
  
   We could also just keep retrying until the lower layers reported no more
   mirror were available.
  
   In btrfs at least, we don't set the data blocks up to date until the crc
   has passed, so replacing bogus blocks is easy.  Metadata is a little
   more complex, but that's not really related to this topic.
  
   mirror number 0 just means no mirror preference/pick the fastest
   mirror to the btrfs block mapping code.
  
   Btrfs has the concept of different raid levels for different logical
   block numbers, so you get_num_copies might return one answer for block A
   and a different answer for block B.
  
   Either way, we could easily make use of a bio field here if it were
   exported out.
  
  you don't want to just pass the value down as that will cause 

Re: Mis-Design of Btrfs?

2011-07-15 Thread Ric Wheeler

On 07/15/2011 12:34 PM, Chris Mason wrote:

Excerpts from NeilBrown's message of 2011-07-15 02:33:54 -0400:

On Thu, 14 Jul 2011 21:58:46 -0700 (PDT) da...@lang.hm wrote:


On Thu, 14 Jul 2011, Chris Mason wrote:


Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:

On 07/14/2011 07:38 AM, NeilBrown wrote:

On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheelerrwhee...@redhat.com   wrote:


I'm certainly open to suggestions and collaboration.  Do you have in mind any
particular way to make the interface richer??

NeilBrown

Hi Neil,

I know that Chris has a very specific set of use cases for btrfs and think that
Alasdair and others have started to look at what is doable.

The obvious use case is the following:

If a file system uses checksumming or other data corruption detection bits, it
can detect that it got bad data on a write. If that data was protected by RAID,
it would like to ask the block layer to try to read from another mirror (for
raid1) or try to validate/rebuild from parity.

Today, I think that a retry will basically just give us back a random chance of
getting data from a different mirror or the same one that we got data from on
the first go.

Chris, Alasdair, was that a good summary of one concern?

Thanks!

Ric

I imagine a new field in 'struct bio' which was normally zero but could be
some small integer.  It is only meaningful for read.
When 0 it means get this data way you like.
When non-zero it means get this data using method N, where the different
methods are up to the device.

For a mirrored RAID, method N means read from device N-1.
For stripe/parity RAID, method 1 means use other data blocks and parity
blocks to reconstruct data.

The default for non RAID devices is to return EINVAL for any N   0.
A remapping device (dm-linear, dm-stripe etc) would just pass the number
down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
some thought.

So if btrfs reads a block and the checksum looks wrong, it reads again with
a larger N.  It continues incrementing N and retrying until it gets a block
that it likes or it gets EINVAL.  There should probably be an error code
(EAGAIN?) which means I cannot work with that number, but try the next one.

It would be trivial for me to implement this for RAID1 and RAID10, and
relatively easy for RAID5.
I'd need to give a bit of thought to RAID6 as there are possibly multiple
ways to reconstruct from different combinations of parity and data.  I'm not
sure if there would be much point in doing that though.

It might make sense for a device to be able to report what the maximum
'N' supported is... that might make stacked raid easier to manage...

NeilBrown


I think that the above makes sense. Not sure what your 0 definition is, but I
would assume that for non-raided devices (i.e., a single s-ata disk), 0 would
be an indication that there is nothing more that can be tried. The data you got
is all you are ever going to get :)

For multiple mirrors, you might want to have a scheme where you would be able to
cycle through the mirrors. You could retry, cycling through the various mirrors
until you have tried and returned them all at which point you would get a no
more error back or some such thing.

Hi everyone,

The mirror number idea is basically what btrfs does today, and I think
it fits in with Neil's idea to have different policies for different
blocks.

Basically what btrfs does:

read_block(block_num, mirror = 0)
if (no io error and not csum error)
 horray()

num_mirrors = get_num_copies(block number)
for (i = 1; i  num_mirrors; i++) {
 read_block(block_num, mirror = i);
}

In a stacked configuration, the get_num_copies function can be smarter,
basically adding up all the copies of the lower levels and finding a way
to combine them.  We could just send down a fake bio that is responsible
for adding up the storage combinations into a bitmask or whatever works.

We could also just keep retrying until the lower layers reported no more
mirror were available.

In btrfs at least, we don't set the data blocks up to date until the crc
has passed, so replacing bogus blocks is easy.  Metadata is a little
more complex, but that's not really related to this topic.

mirror number 0 just means no mirror preference/pick the fastest
mirror to the btrfs block mapping code.

Btrfs has the concept of different raid levels for different logical
block numbers, so you get_num_copies might return one answer for block A
and a different answer for block B.

Either way, we could easily make use of a bio field here if it were
exported out.

you don't want to just pass the value down as that will cause problems
with layering (especially if the lower layer supports more values than a
higher layer)

I would suggest that each layer take the value it's give, do an integer
division by the number of values that layer supports, using the modulo
value for that layer and pass the rest of the result down to the next
layer.

I, on the other hand, would 

Re: Mis-Design of Btrfs?

2011-07-15 Thread Chris Mason
Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
 On 07/15/2011 12:34 PM, Chris Mason wrote:

[ triggering IO retries on failed crc or other checks ]

 
  But, maybe the whole btrfs model is backwards for a generic layer.
  Instead of sending down ios and testing when they come back, we could
  just set a verification function (or stack of them?).
 
  For metadata, btrfs compares the crc and a few other fields of the
  metadata block, so we can easily add a compare function pointer and a
  void * to pass in.
 
  The problem is the crc can take a lot of CPU, so btrfs kicks it off to
  threading pools so saturate all the cpus on the box.  But there's no
  reason we can't make that available lower down.
 
  If we pushed the verification down, the retries could bubble up the
  stack instead of the other way around.
 
  -chris
 
 I do like the idea of having the ability to do the verification and retries 
 down 
 the stack where you actually have the most context to figure out what is 
 possible...
 
 Why would you need to bubble back up anything other than an error when all 
 retries have failed?

By bubble up I mean that if you have multiple layers capable of doing
retries, the lowest levels would retry first.  Basically by the time we
get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
do to help.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-15 Thread Ric Wheeler

On 07/15/2011 02:20 PM, Chris Mason wrote:

Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:

On 07/15/2011 12:34 PM, Chris Mason wrote:

[ triggering IO retries on failed crc or other checks ]


But, maybe the whole btrfs model is backwards for a generic layer.
Instead of sending down ios and testing when they come back, we could
just set a verification function (or stack of them?).

For metadata, btrfs compares the crc and a few other fields of the
metadata block, so we can easily add a compare function pointer and a
void * to pass in.

The problem is the crc can take a lot of CPU, so btrfs kicks it off to
threading pools so saturate all the cpus on the box.  But there's no
reason we can't make that available lower down.

If we pushed the verification down, the retries could bubble up the
stack instead of the other way around.

-chris

I do like the idea of having the ability to do the verification and retries down
the stack where you actually have the most context to figure out what is 
possible...

Why would you need to bubble back up anything other than an error when all
retries have failed?

By bubble up I mean that if you have multiple layers capable of doing
retries, the lowest levels would retry first.  Basically by the time we
get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
do to help.

-chris


Absolutely sounds like the most sane way to go to me, thanks!

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-15 Thread Hugo Mills
On Fri, Jul 15, 2011 at 10:00:35AM -0400, Chris Mason wrote:
 Excerpts from Ric Wheeler's message of 2011-07-15 09:31:37 -0400:
  On 07/15/2011 02:20 PM, Chris Mason wrote:
   Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
   On 07/15/2011 12:34 PM, Chris Mason wrote:
   [ triggering IO retries on failed crc or other checks ]
  
   But, maybe the whole btrfs model is backwards for a generic layer.
   Instead of sending down ios and testing when they come back, we could
   just set a verification function (or stack of them?).
  
   For metadata, btrfs compares the crc and a few other fields of the
   metadata block, so we can easily add a compare function pointer and a
   void * to pass in.
  
   The problem is the crc can take a lot of CPU, so btrfs kicks it off to
   threading pools so saturate all the cpus on the box.  But there's no
   reason we can't make that available lower down.
  
   If we pushed the verification down, the retries could bubble up the
   stack instead of the other way around.
  
   -chris
   I do like the idea of having the ability to do the verification and 
   retries down
   the stack where you actually have the most context to figure out what is 
   possible...
  
   Why would you need to bubble back up anything other than an error when 
   all
   retries have failed?
   By bubble up I mean that if you have multiple layers capable of doing
   retries, the lowest levels would retry first.  Basically by the time we
   get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
   do to help.
  
   -chris
  
  Absolutely sounds like the most sane way to go to me, thanks!
  
 
 It really seemed like a good idea, but I just realized it doesn't work
 well when parts of the stack transform the data.
 
 Picture dm-crypt on top of raid1.  If raid1 is responsible for the
 crc retries, there's no way to crc the data because it needs to be
 decrypted first.
 
 I think the raided dm-crypt config is much more common (and interesting)
 than multiple layers that can retry for other reasons (raid1 on top of
 raid10?)

   Isn't this a case where the transformative mid-layer would replace
the validation function before passing it down the stack? So btrfs
hands dm-crypt a checksum function; dm-crypt then stores that function
for its own purposes and hands off a new function to the DM layer
below that which decrypts the data and calls the btrfs checksum
function it stored earlier.

 In other words, do we really want to do a lot of design work for
 multiple layers where each one maintains multiple copies of the data
 blocks?  Are there configs where this really makes sense?

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- What are we going to do tonight? The same thing we do --- 
every night, Pinky.  Try to take over the world!


signature.asc
Description: Digital signature


Re: Mis-Design of Btrfs?

2011-07-15 Thread Chris Mason
Excerpts from Hugo Mills's message of 2011-07-15 10:07:24 -0400:
 On Fri, Jul 15, 2011 at 10:00:35AM -0400, Chris Mason wrote:
  Excerpts from Ric Wheeler's message of 2011-07-15 09:31:37 -0400:
   On 07/15/2011 02:20 PM, Chris Mason wrote:
Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
On 07/15/2011 12:34 PM, Chris Mason wrote:
[ triggering IO retries on failed crc or other checks ]
   
But, maybe the whole btrfs model is backwards for a generic layer.
Instead of sending down ios and testing when they come back, we could
just set a verification function (or stack of them?).
   
For metadata, btrfs compares the crc and a few other fields of the
metadata block, so we can easily add a compare function pointer and a
void * to pass in.
   
The problem is the crc can take a lot of CPU, so btrfs kicks it off to
threading pools so saturate all the cpus on the box.  But there's no
reason we can't make that available lower down.
   
If we pushed the verification down, the retries could bubble up the
stack instead of the other way around.
   
-chris
I do like the idea of having the ability to do the verification and 
retries down
the stack where you actually have the most context to figure out what 
is possible...
   
Why would you need to bubble back up anything other than an error when 
all
retries have failed?
By bubble up I mean that if you have multiple layers capable of doing
retries, the lowest levels would retry first.  Basically by the time we
get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
do to help.
   
-chris
   
   Absolutely sounds like the most sane way to go to me, thanks!
   
  
  It really seemed like a good idea, but I just realized it doesn't work
  well when parts of the stack transform the data.
  
  Picture dm-crypt on top of raid1.  If raid1 is responsible for the
  crc retries, there's no way to crc the data because it needs to be
  decrypted first.
  
  I think the raided dm-crypt config is much more common (and interesting)
  than multiple layers that can retry for other reasons (raid1 on top of
  raid10?)
 
Isn't this a case where the transformative mid-layer would replace
 the validation function before passing it down the stack? So btrfs
 hands dm-crypt a checksum function; dm-crypt then stores that function
 for its own purposes and hands off a new function to the DM layer
 below that which decrypts the data and calls the btrfs checksum
 function it stored earlier.

Then we're requiring each transformation layer to have their own crcs,
and if the higher layers have a stronger crc (or other checks), there's
no path to ask the lower layers for other copies.

Here's a concrete example.  In each metadata block, btrfs stores the
fsid and the transid of the transaction that created it.  In the case of
a missed write, we'll read a perfect block from the lower layers.  Any
crcs will be correct and it'll pass through dm-crypt with flying colors.

But, it won't be the right block.  Btrfs will notice this and EIO.  In
the current ask-for-another-mirror config we'll go down and grab the
other copy.

In the stacked validation function model, dm-crypt replaces our
verification functions with something that operates on the encrypted
data, and it won't be able to detect the error or kick down to the
underlying raid1 for another copy.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-15 Thread Hugo Mills
On Fri, Jul 15, 2011 at 10:24:25AM -0400, Chris Mason wrote:
 Excerpts from Hugo Mills's message of 2011-07-15 10:07:24 -0400:
  On Fri, Jul 15, 2011 at 10:00:35AM -0400, Chris Mason wrote:
   Excerpts from Ric Wheeler's message of 2011-07-15 09:31:37 -0400:
On 07/15/2011 02:20 PM, Chris Mason wrote:
 Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
 On 07/15/2011 12:34 PM, Chris Mason wrote:
 [ triggering IO retries on failed crc or other checks ]

 But, maybe the whole btrfs model is backwards for a generic layer.
 Instead of sending down ios and testing when they come back, we 
 could
 just set a verification function (or stack of them?).

 For metadata, btrfs compares the crc and a few other fields of the
 metadata block, so we can easily add a compare function pointer and 
 a
 void * to pass in.

 The problem is the crc can take a lot of CPU, so btrfs kicks it off 
 to
 threading pools so saturate all the cpus on the box.  But there's no
 reason we can't make that available lower down.

 If we pushed the verification down, the retries could bubble up the
 stack instead of the other way around.

 -chris
 I do like the idea of having the ability to do the verification and 
 retries down
 the stack where you actually have the most context to figure out 
 what is possible...

 Why would you need to bubble back up anything other than an error 
 when all
 retries have failed?
 By bubble up I mean that if you have multiple layers capable of doing
 retries, the lowest levels would retry first.  Basically by the time 
 we
 get an -EIO_ALREADY_RETRIED we know there's nothing that lower level 
 can
 do to help.

 -chris

Absolutely sounds like the most sane way to go to me, thanks!

   
   It really seemed like a good idea, but I just realized it doesn't work
   well when parts of the stack transform the data.
   
   Picture dm-crypt on top of raid1.  If raid1 is responsible for the
   crc retries, there's no way to crc the data because it needs to be
   decrypted first.
   
   I think the raided dm-crypt config is much more common (and interesting)
   than multiple layers that can retry for other reasons (raid1 on top of
   raid10?)
  
 Isn't this a case where the transformative mid-layer would replace
  the validation function before passing it down the stack? So btrfs
  hands dm-crypt a checksum function; dm-crypt then stores that function
  for its own purposes and hands off a new function to the DM layer
  below that which decrypts the data and calls the btrfs checksum
  function it stored earlier.
 
 Then we're requiring each transformation layer to have their own crcs,
 and if the higher layers have a stronger crc (or other checks), there's
 no path to ask the lower layers for other copies.
 
 Here's a concrete example.  In each metadata block, btrfs stores the
 fsid and the transid of the transaction that created it.  In the case of
 a missed write, we'll read a perfect block from the lower layers.  Any
 crcs will be correct and it'll pass through dm-crypt with flying colors.
 
 But, it won't be the right block.  Btrfs will notice this and EIO.  In
 the current ask-for-another-mirror config we'll go down and grab the
 other copy.
 
 In the stacked validation function model, dm-crypt replaces our
 verification functions with something that operates on the encrypted
 data, and it won't be able to detect the error or kick down to the
 underlying raid1 for another copy.

   What I'm suggesting is easiest to describe with a functional
language, or mathematical notation, but I'll try without those
anyway...

   A generic validation function is a two-parameter function, taking a
block of data and some layer-dependent state, and returning a boolean.

So, at the btrfs layer, we have something like:

struct btrfs_validate_state {
u32 cs;
};

bool btrfs_validate(void *state, char *data) {
return crc32(data) == (btrfs_validate_state*)state-cs;
}

   When reading a specific block, we look up the checksum for that
block, put it in state-cs and pass both our state and the
btrfs_validate function to the lower layer.

   The crypto layer beneath will look something like:

struct crypto_validate_state {
void *old_state;
bool (*old_validator)(void *state, char* data);
};

bool crypto_validate(void *state, char *data) {
char plaintext[4096];
decrypt(plaintext, data);

return state-old_validator(state-old_state, plaintext);
}

   Then, when a request is received from the upper layer, we can put
the (state, validator) pair into a crypto_validate_state structure,
call that our new state, and pass on the new (state, crypto_validate)
to the layer underneath.

   So, where a transformative layer exists in the stack, it replaces
the validation function with a helper function 

Re: Mis-Design of Btrfs?

2011-07-15 Thread Christian Aßfalg
Am Freitag, den 15.07.2011, 10:24 -0400 schrieb Chris Mason:
 Excerpts from Hugo Mills's message of 2011-07-15 10:07:24 -0400:
  On Fri, Jul 15, 2011 at 10:00:35AM -0400, Chris Mason wrote:
   Excerpts from Ric Wheeler's message of 2011-07-15 09:31:37 -0400:
On 07/15/2011 02:20 PM, Chris Mason wrote:
 Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
 On 07/15/2011 12:34 PM, Chris Mason wrote:
 [ triggering IO retries on failed crc or other checks ]

 But, maybe the whole btrfs model is backwards for a generic layer.
 Instead of sending down ios and testing when they come back, we 
 could
 just set a verification function (or stack of them?).

 For metadata, btrfs compares the crc and a few other fields of the
 metadata block, so we can easily add a compare function pointer and 
 a
 void * to pass in.

 The problem is the crc can take a lot of CPU, so btrfs kicks it off 
 to
 threading pools so saturate all the cpus on the box.  But there's no
 reason we can't make that available lower down.

 If we pushed the verification down, the retries could bubble up the
 stack instead of the other way around.

 -chris
 I do like the idea of having the ability to do the verification and 
 retries down
 the stack where you actually have the most context to figure out 
 what is possible...

 Why would you need to bubble back up anything other than an error 
 when all
 retries have failed?
 By bubble up I mean that if you have multiple layers capable of doing
 retries, the lowest levels would retry first.  Basically by the time 
 we
 get an -EIO_ALREADY_RETRIED we know there's nothing that lower level 
 can
 do to help.

 -chris

Absolutely sounds like the most sane way to go to me, thanks!

   
   It really seemed like a good idea, but I just realized it doesn't work
   well when parts of the stack transform the data.
   
   Picture dm-crypt on top of raid1.  If raid1 is responsible for the
   crc retries, there's no way to crc the data because it needs to be
   decrypted first.
   
   I think the raided dm-crypt config is much more common (and interesting)
   than multiple layers that can retry for other reasons (raid1 on top of
   raid10?)
  
 Isn't this a case where the transformative mid-layer would replace
  the validation function before passing it down the stack? So btrfs
  hands dm-crypt a checksum function; dm-crypt then stores that function
  for its own purposes and hands off a new function to the DM layer
  below that which decrypts the data and calls the btrfs checksum
  function it stored earlier.
 
 Then we're requiring each transformation layer to have their own crcs,
 and if the higher layers have a stronger crc (or other checks), there's
 no path to ask the lower layers for other copies.
 
 Here's a concrete example.  In each metadata block, btrfs stores the
 fsid and the transid of the transaction that created it.  In the case of
 a missed write, we'll read a perfect block from the lower layers.  Any
 crcs will be correct and it'll pass through dm-crypt with flying colors.
 
 But, it won't be the right block.  Btrfs will notice this and EIO.  In
 the current ask-for-another-mirror config we'll go down and grab the
 other copy.
 
 In the stacked validation function model, dm-crypt replaces our
 verification functions with something that operates on the encrypted
 data, and it won't be able to detect the error or kick down to the
 underlying raid1 for another copy.
 
 -chris
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


I think the point is not to replace the crc function in the dm_crypt
case, but to wrap it with an decrypt function which then calls the crc
function. So even if a lower mirror uses the new dm-crypt crc function,
the btrfs crc function still gets called - at the end of the chain.

Regards,
Christian Aßfalg

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-15 Thread david

On Fri, 15 Jul 2011, NeilBrown wrote:


On Thu, 14 Jul 2011 21:58:46 -0700 (PDT) da...@lang.hm wrote:


On Thu, 14 Jul 2011, Chris Mason wrote:


Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:

On 07/14/2011 07:38 AM, NeilBrown wrote:

On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheelerrwhee...@redhat.com  wrote:



I would suggest that each layer take the value it's give, do an integer
division by the number of values that layer supports, using the modulo
value for that layer and pass the rest of the result down to the next
layer.


I, on the other hand, would suggest that each layer be given the freedom, and
the responsibility, to do whatever it thinks is most appropriate.

This would probably involved checking the lower levels to see how many
strategies each has, and doing some appropriate arithmetic depending on how
it combines those devices.


nothing is wrong with doing something smarter than what I was proposing, I 
was just intending to propose a default rule that would work (just about) 
everywhere. I was specifically trying to counter the throught hat the 
method number would/should be contant as it's passed down the layers.


The proposal had been made to have each layer do the retries for the layer 
below it, that would avoid this stacking problem, but at the cost of 
potentially doing a LOT of retries without the upper level being able to 
limit it.


David Lang


One problem here is the assumption that the lower levels don't change, and we
know that not to be the case.
However this is already a problem.  It is entirely possible that the request
size parameters of devices can change at any moment, even while a request is
in-flight ... though we try to avoid that or work around it.

The sort of dependencies that we see forming here would probably just make
the problem worse.

Not sure what to do about it though maybe just hope it isn't a problem.

NeilBrown




this works for just trying values until you get the error that tells you
there are no more options.


things can get very 'intersesting' if the different options below you have
different number of options (say a raid1 across a raid5 and a single disk)
but I can't think of any good way to figure this out and assemble a
sane order without doing an exaustive search down to find the number of
options for each layer (and since this can be different for different
blocks, you would have to do this each time you invoked this option)

David Lang




--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-15 Thread david

On Fri, 15 Jul 2011, Chris Mason wrote:


Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:

On 07/15/2011 12:34 PM, Chris Mason wrote:


By bubble up I mean that if you have multiple layers capable of doing
retries, the lowest levels would retry first.  Basically by the time we
get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
do to help.


the problem with doing this is that it can end up stalling the box for 
significant amounts of time while all the retries happen.


we already see this happening today where a disk read failure is retried 
multiple times by the disk, multiple times by the raid controller, and 
then multiple times by Linux, resulting is multi-minute stalls when you 
hit a disk error in some cases.


having the lower layers do the retries automatically runs the risk of 
making this even worse.


This needs to be able to be throttled by some layer that can see the 
entire picture (either by cutting off the retries after a number, after 
some time, or by spacing out the retries to allow other queries to get in 
and let the box do useful work in the meantime)


David Lang

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-15 Thread Ric Wheeler

On 07/15/2011 05:23 PM, da...@lang.hm wrote:

On Fri, 15 Jul 2011, Chris Mason wrote:


Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:

On 07/15/2011 12:34 PM, Chris Mason wrote:


By bubble up I mean that if you have multiple layers capable of doing
retries, the lowest levels would retry first.  Basically by the time we
get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
do to help.


the problem with doing this is that it can end up stalling the box for 
significant amounts of time while all the retries happen.


we already see this happening today where a disk read failure is retried 
multiple times by the disk, multiple times by the raid controller, and then 
multiple times by Linux, resulting is multi-minute stalls when you hit a disk 
error in some cases.


having the lower layers do the retries automatically runs the risk of making 
this even worse.


This needs to be able to be throttled by some layer that can see the entire 
picture (either by cutting off the retries after a number, after some time, or 
by spacing out the retries to allow other queries to get in and let the box do 
useful work in the meantime)


David Lang



That should not be an issue - we have a fast fail path for IO that should 
avoid retrying just for those reasons (i.e., for multi-path or when recovering a 
flaky drive).


This is not a scheme for unbounded retries. If you have a 3 disk mirror in 
RAID1, you would read the data no more than 2 extra times and almost never more 
than once.  That should be *much* faster than the multiple-second long timeout 
you see when waiting for SCSI timeout to fire, etc.


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-15 Thread david

On Fri, 15 Jul 2011, Ric Wheeler wrote:


On 07/15/2011 05:23 PM, da...@lang.hm wrote:

On Fri, 15 Jul 2011, Chris Mason wrote:


Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:

On 07/15/2011 12:34 PM, Chris Mason wrote:


By bubble up I mean that if you have multiple layers capable of doing
retries, the lowest levels would retry first.  Basically by the time we
get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
do to help.


the problem with doing this is that it can end up stalling the box for 
significant amounts of time while all the retries happen.


we already see this happening today where a disk read failure is retried 
multiple times by the disk, multiple times by the raid controller, and then 
multiple times by Linux, resulting is multi-minute stalls when you hit a 
disk error in some cases.


having the lower layers do the retries automatically runs the risk of 
making this even worse.


This needs to be able to be throttled by some layer that can see the entire 
picture (either by cutting off the retries after a number, after some time, 
or by spacing out the retries to allow other queries to get in and let the 
box do useful work in the meantime)


David Lang



That should not be an issue - we have a fast fail path for IO that should 
avoid retrying just for those reasons (i.e., for multi-path or when 
recovering a flaky drive).


This is not a scheme for unbounded retries. If you have a 3 disk mirror in 
RAID1, you would read the data no more than 2 extra times and almost never 
more than once.  That should be *much* faster than the multiple-second long 
timeout you see when waiting for SCSI timeout to fire, etc.


this issue is when you stack things.

what if you have a 3 piece raid 1 on top of a 4 piece raid 6?

so you have 3 raid1 retries * N raid 6 retries. depending on the order 
that you do the retries in, and how long it takes that try to fail, this 
could start to take significant amounts of time.


if you do a retry on the lower level first, the raid 6 could try several 
different ways to combine the drives to get the valid data (disks 1,2  2,3 
3,4 1,3 1,4 2,4 1,2,3 1,2,4 1,3,4 2,3,4) add more disks and it gets worse 
fast.


add more layers and you multiple the number of possible retries.

my guess is that changing to a different method at the upper level is 
going to avoid the problem area faster then doing so at a lower level 
(because there is less hardware in common with the method that just gave 
the wrong answer)


David Lang
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-15 Thread Ric Wheeler

On 07/15/2011 06:01 PM, da...@lang.hm wrote:

On Fri, 15 Jul 2011, Ric Wheeler wrote:


On 07/15/2011 05:23 PM, da...@lang.hm wrote:

On Fri, 15 Jul 2011, Chris Mason wrote:


Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:

On 07/15/2011 12:34 PM, Chris Mason wrote:


By bubble up I mean that if you have multiple layers capable of doing
retries, the lowest levels would retry first.  Basically by the time we
get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
do to help.


the problem with doing this is that it can end up stalling the box for 
significant amounts of time while all the retries happen.


we already see this happening today where a disk read failure is retried 
multiple times by the disk, multiple times by the raid controller, and then 
multiple times by Linux, resulting is multi-minute stalls when you hit a 
disk error in some cases.


having the lower layers do the retries automatically runs the risk of making 
this even worse.


This needs to be able to be throttled by some layer that can see the entire 
picture (either by cutting off the retries after a number, after some time, 
or by spacing out the retries to allow other queries to get in and let the 
box do useful work in the meantime)


David Lang



That should not be an issue - we have a fast fail path for IO that should 
avoid retrying just for those reasons (i.e., for multi-path or when 
recovering a flaky drive).


This is not a scheme for unbounded retries. If you have a 3 disk mirror in 
RAID1, you would read the data no more than 2 extra times and almost never 
more than once.  That should be *much* faster than the multiple-second long 
timeout you see when waiting for SCSI timeout to fire, etc.


this issue is when you stack things.

what if you have a 3 piece raid 1 on top of a 4 piece raid 6?

so you have 3 raid1 retries * N raid 6 retries. depending on the order that 
you do the retries in, and how long it takes that try to fail, this could 
start to take significant amounts of time.


if you do a retry on the lower level first, the raid 6 could try several 
different ways to combine the drives to get the valid data (disks 1,2  2,3 3,4 
1,3 1,4 2,4 1,2,3 1,2,4 1,3,4 2,3,4) add more disks and it gets worse fast.


add more layers and you multiple the number of possible retries.

my guess is that changing to a different method at the upper level is going to 
avoid the problem area faster then doing so at a lower level (because there is 
less hardware in common with the method that just gave the wrong answer)


David Lang


At some point, the question is why would you do that?  Two parity drives for 
each 4 drive RAID-6 set and then mirror that 3 times (12 drives in total, only 2 
data drives)? Better off doing a 4 way mirror :)


I think that you are still missing the point.

If the lowest layer can repair the data, it would return the first validated 
answer. The major time sync would be the IO to read the sectors from each of the 
4 drives in your example, not computing the various combination of parity or 
validating (all done in memory).


If the RAID-6 layer failed, you would do the same for each of the mirrors which 
would read the IO from each of their RAID-6 low levels exactly once as well (and 
then verify in memory).


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread Ric Wheeler

On 07/14/2011 06:56 AM, NeilBrown wrote:

On Wed, 29 Jun 2011 10:29:53 +0100 Ric Wheelerrwhee...@redhat.com  wrote:


On 06/27/2011 07:46 AM, NeilBrown wrote:

On Thu, 23 Jun 2011 12:53:37 +0200 Nico Schottelius
nico-lkml-20110...@schottelius.org   wrote:


Good morning devs,

I'm wondering whether the raid- and volume-management-builtin of btrfs is
actually a sane idea or not.
Currently we do have md/device-mapper support for raid
already, btrfs lacks raid5 support and re-implements stuff that
has already been done.

I'm aware of the fact that it is very useful to know on which devices
we are in a filesystem. But I'm wondering, whether it wouldn't be
smarter to generalise the information exposure through the VFS layer
instead of replicating functionality:

Physical:   USB-HD   SSD   USB-Flash  | Exposes information to
Raid:   Raid1, Raid5, Raid10, etc.| higher levels
Crypto: Luks  |
LVM:Groups/Volumes|
FS: xfs/jfs/reiser/ext3   v

Thus a filesystem like ext3 could be aware that it is running
on a USB HD, enable -o sync be default or have the filesystem
to rewrite blocks when running on crypto or optimise for an SSD, ...

I would certainly agree that exposing information to higher levels is a good
idea.  To some extent we do.  But it isn't always as easy as it might sound.
Choosing exactly what information to expose is the challenge.  If you lack
sufficient foresight you might expose something which turns out to be
very specific to just one device, so all those upper levels which make use of
the information find they are really special-casing one specific device,
which isn't a good idea.


However it doesn't follow that RAID5 should not be implemented in BTRFS.
The levels that you have drawn are just one perspective.  While that has
value, it may not be universal.
I could easily argue that the LVM layer is a mistake and that filesystems
should provide that functionality directly.
I could almost argue the same for crypto.
RAID1 can make a lot of sense to be tightly integrated with the FS.
RAID5 ... I'm less convinced, but then I have a vested interest there so that
isn't an objective assessment.

Part of the way Linux works is that s/he who writes the code gets to make
the design decisions.   The BTRFS developers might create something truly
awesome, or might end up having to support a RAID feature that they
subsequently think is a bad idea.  But it really is their decision to make.

NeilBrown


One more thing to add here is that I think that we still have a chance to
increase the sharing between btrfs and the MD stack if we can get those changes
made. No one likes to duplicate code, but we will need a richer interface
between the block and file system layer to help close that gap.

Ric


I'm certainly open to suggestions and collaboration.  Do you have in mind any
particular way to make the interface richer??

NeilBrown


Hi Neil,

I know that Chris has a very specific set of use cases for btrfs and think that 
Alasdair and others have started to look at what is doable.


The obvious use case is the following:

If a file system uses checksumming or other data corruption detection bits, it 
can detect that it got bad data on a write. If that data was protected by RAID, 
it would like to ask the block layer to try to read from another mirror (for 
raid1) or try to validate/rebuild from parity.


Today, I think that a retry will basically just give us back a random chance of 
getting data from a different mirror or the same one that we got data from on 
the first go.


Chris, Alasdair, was that a good summary of one concern?

Thanks!

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread NeilBrown
On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler rwhee...@redhat.com wrote:

  I'm certainly open to suggestions and collaboration.  Do you have in mind 
  any
  particular way to make the interface richer??
 
  NeilBrown
 
 Hi Neil,
 
 I know that Chris has a very specific set of use cases for btrfs and think 
 that 
 Alasdair and others have started to look at what is doable.
 
 The obvious use case is the following:
 
 If a file system uses checksumming or other data corruption detection bits, 
 it 
 can detect that it got bad data on a write. If that data was protected by 
 RAID, 
 it would like to ask the block layer to try to read from another mirror (for 
 raid1) or try to validate/rebuild from parity.
 
 Today, I think that a retry will basically just give us back a random chance 
 of 
 getting data from a different mirror or the same one that we got data from on 
 the first go.
 
 Chris, Alasdair, was that a good summary of one concern?
 
 Thanks!
 
 Ric

I imagine a new field in 'struct bio' which was normally zero but could be
some small integer.  It is only meaningful for read.
When 0 it means get this data way you like.
When non-zero it means get this data using method N, where the different
methods are up to the device.

For a mirrored RAID, method N means read from device N-1.
For stripe/parity RAID, method 1 means use other data blocks and parity
blocks to reconstruct data.

The default for non RAID devices is to return EINVAL for any N  0.
A remapping device (dm-linear, dm-stripe etc) would just pass the number
down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
some thought.

So if btrfs reads a block and the checksum looks wrong, it reads again with
a larger N.  It continues incrementing N and retrying until it gets a block
that it likes or it gets EINVAL.  There should probably be an error code
(EAGAIN?) which means I cannot work with that number, but try the next one.

It would be trivial for me to implement this for RAID1 and RAID10, and
relatively easy for RAID5.
I'd need to give a bit of thought to RAID6 as there are possibly multiple
ways to reconstruct from different combinations of parity and data.  I'm not
sure if there would be much point in doing that though.

It might make sense for a device to be able to report what the maximum
'N' supported is... that might make stacked raid easier to manage...

NeilBrown

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread Ric Wheeler

On 07/14/2011 07:38 AM, NeilBrown wrote:

On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheelerrwhee...@redhat.com  wrote:


I'm certainly open to suggestions and collaboration.  Do you have in mind any
particular way to make the interface richer??

NeilBrown

Hi Neil,

I know that Chris has a very specific set of use cases for btrfs and think that
Alasdair and others have started to look at what is doable.

The obvious use case is the following:

If a file system uses checksumming or other data corruption detection bits, it
can detect that it got bad data on a write. If that data was protected by RAID,
it would like to ask the block layer to try to read from another mirror (for
raid1) or try to validate/rebuild from parity.

Today, I think that a retry will basically just give us back a random chance of
getting data from a different mirror or the same one that we got data from on
the first go.

Chris, Alasdair, was that a good summary of one concern?

Thanks!

Ric

I imagine a new field in 'struct bio' which was normally zero but could be
some small integer.  It is only meaningful for read.
When 0 it means get this data way you like.
When non-zero it means get this data using method N, where the different
methods are up to the device.

For a mirrored RAID, method N means read from device N-1.
For stripe/parity RAID, method 1 means use other data blocks and parity
blocks to reconstruct data.

The default for non RAID devices is to return EINVAL for any N  0.
A remapping device (dm-linear, dm-stripe etc) would just pass the number
down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
some thought.

So if btrfs reads a block and the checksum looks wrong, it reads again with
a larger N.  It continues incrementing N and retrying until it gets a block
that it likes or it gets EINVAL.  There should probably be an error code
(EAGAIN?) which means I cannot work with that number, but try the next one.

It would be trivial for me to implement this for RAID1 and RAID10, and
relatively easy for RAID5.
I'd need to give a bit of thought to RAID6 as there are possibly multiple
ways to reconstruct from different combinations of parity and data.  I'm not
sure if there would be much point in doing that though.

It might make sense for a device to be able to report what the maximum
'N' supported is... that might make stacked raid easier to manage...

NeilBrown



I think that the above makes sense. Not sure what your 0 definition is, but I 
would assume that for non-raided devices (i.e., a single s-ata disk), 0 would 
be an indication that there is nothing more that can be tried. The data you got 
is all you are ever going to get :)


For multiple mirrors, you might want to have a scheme where you would be able to 
cycle through the mirrors. You could retry, cycling through the various mirrors 
until you have tried and returned them all at which point you would get a no 
more error back or some such thing.


Thanks!

ric

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread Arne Jansen

On 14.07.2011 08:02, Ric Wheeler wrote:

On 07/14/2011 06:56 AM, NeilBrown wrote:

I'm certainly open to suggestions and collaboration. Do you have in
mind any
particular way to make the interface richer??


If a file system uses checksumming or other data corruption detection
bits, it can detect that it got bad data on a write. If that data was
protected by RAID, it would like to ask the block layer to try to read
from another mirror (for raid1) or try to validate/rebuild from parity.

Today, I think that a retry will basically just give us back a random
chance of getting data from a different mirror or the same one that we
got data from on the first go.


Another case that comes to mind is the 'remove device' operation.
To accomplish this, btrfs just rewrites all the data that reside
on that device to other drives.
Also, scrub and my recently posted readahead patches make heavy
use of the knowledge of how the raid is laid out. Readahead always
uses as many drives as possible in parallel, while trying to
avoid unnecessary seeks on each device.

-Arne



Chris, Alasdair, was that a good summary of one concern?

Thanks!

Ric


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread Jan Schmidt
Hi Neil,

On 14.07.2011 08:38, NeilBrown wrote:
 I imagine a new field in 'struct bio' which was normally zero but could be
 some small integer.  It is only meaningful for read.
 When 0 it means get this data way you like.
 When non-zero it means get this data using method N, where the different
 methods are up to the device.
 
 For a mirrored RAID, method N means read from device N-1.
 For stripe/parity RAID, method 1 means use other data blocks and parity
 blocks to reconstruct data.
 
 The default for non RAID devices is to return EINVAL for any N  0.
 A remapping device (dm-linear, dm-stripe etc) would just pass the number
 down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
 some thought.
 
 So if btrfs reads a block and the checksum looks wrong, it reads again with
 a larger N.  It continues incrementing N and retrying until it gets a block
 that it likes or it gets EINVAL.  There should probably be an error code
 (EAGAIN?) which means I cannot work with that number, but try the next one.

I like this idea. It comes pretty close to what btrfs is currently doing
(what was the reason for this thread being started, wasn't it?), only
not within struct bio.

The way you describe the extra parameter is input only. I think it would
be a nice add on if we knew which N was used when 0 passed for the
request. At least for the RAID1 case, I'd like to see that when I submit
a bio with method (or whatever we call it) 0, it comes back with the
method field set to the value that reflects the method the device
actually used. Obviously, when passing non-zero values, the bio would
have to come back with that value unmodified.

That way, we'll get more control on the retry algorithms and are free to
decide not to use the failed method again, if we like.

Setting method on the return path might be valuable not only for
RAID1, but I haven't thought that trough.

-Jan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread NeilBrown
On Thu, 14 Jul 2011 11:37:41 +0200 Jan Schmidt list.bt...@jan-o-sch.net
wrote:

 Hi Neil,
 
 On 14.07.2011 08:38, NeilBrown wrote:
  I imagine a new field in 'struct bio' which was normally zero but could be
  some small integer.  It is only meaningful for read.
  When 0 it means get this data way you like.
  When non-zero it means get this data using method N, where the different
  methods are up to the device.
  
  For a mirrored RAID, method N means read from device N-1.
  For stripe/parity RAID, method 1 means use other data blocks and parity
  blocks to reconstruct data.
  
  The default for non RAID devices is to return EINVAL for any N  0.
  A remapping device (dm-linear, dm-stripe etc) would just pass the number
  down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
  some thought.
  
  So if btrfs reads a block and the checksum looks wrong, it reads again with
  a larger N.  It continues incrementing N and retrying until it gets a block
  that it likes or it gets EINVAL.  There should probably be an error code
  (EAGAIN?) which means I cannot work with that number, but try the next 
  one.
 
 I like this idea. It comes pretty close to what btrfs is currently doing
 (what was the reason for this thread being started, wasn't it?), only
 not within struct bio.
 
 The way you describe the extra parameter is input only. I think it would
 be a nice add on if we knew which N was used when 0 passed for the
 request. At least for the RAID1 case, I'd like to see that when I submit
 a bio with method (or whatever we call it) 0, it comes back with the
 method field set to the value that reflects the method the device
 actually used. Obviously, when passing non-zero values, the bio would
 have to come back with that value unmodified.
 
 That way, we'll get more control on the retry algorithms and are free to
 decide not to use the failed method again, if we like.
 
 Setting method on the return path might be valuable not only for
 RAID1, but I haven't thought that trough.
 
 -Jan

That sounds like it would be reasonable...

It would be important not to read too much into the number though.  i.e.
don't think of it as RAID1 but keep a much more abstract idea, else you
could run into trouble.

A (near) future addition to md is keeping track of 'bad blocks' so we can
fail more gracefully as devices start to fail.
So a read request to a RAID1 might not be served by just one device, but
might be served by one device for some parts, and another device for other
parts, so as to avoid one or more 'bad blocks'.

I think I could still provide a reasonably consistent mapping from 'arbitrary
number' to 'set of devices', but it may not be what you expect.  And the
number '1' may well correspond to different devices for different bi_sector
offsets.

i.e. the abstraction must allow the low level implementation substantial
freedom to choosing how to implement each request.

But yes - I don't see why we couldn't report which strategy was used with the
implication that using that same strategy at the same offset with the same
size would be expected to produce the same result.

NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread Goffredo Baroncelli
On 07/14/2011 08:38 AM, NeilBrown wrote:
 On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler rwhee...@redhat.com wrote:
 
 I'm certainly open to suggestions and collaboration.  Do you have in mind 
 any
 particular way to make the interface richer??

 NeilBrown

 Hi Neil,

 I know that Chris has a very specific set of use cases for btrfs and think 
 that 
 Alasdair and others have started to look at what is doable.

 The obvious use case is the following:

 If a file system uses checksumming or other data corruption detection bits, 
 it 
 can detect that it got bad data on a write. If that data was protected by 
 RAID, 
 it would like to ask the block layer to try to read from another mirror (for 
 raid1) or try to validate/rebuild from parity.

 Today, I think that a retry will basically just give us back a random chance 
 of 
 getting data from a different mirror or the same one that we got data from 
 on 
 the first go.

 Chris, Alasdair, was that a good summary of one concern?

 Thanks!

 Ric
 
 I imagine a new field in 'struct bio' which was normally zero but could be
 some small integer.  It is only meaningful for read.
 When 0 it means get this data way you like.
 When non-zero it means get this data using method N, where the different
 methods are up to the device.

In more general terms, the filesystem should be able to require: try
another read different regarding the previous ones. The term are
important because we should differentiate the case of wrong data from
disk1, read from disk0 and wrong data from disk0 read disk1. I prefer
thinking the field as bitmap. Every bit represent a different way of
read. So it is possible to reuse to track which kind of read was
already used.

After a 2nd read, the block layer should:
a) redo the read if possible, otherwise FAIL
b) pass the data to the filesystem
c) if the filesystem accepts the new data, replace the wrong
   data with the correct one or mark the block as broken.
d) inform the userspace/filesystem of the result

 
 For a mirrored RAID, method N means read from device N-1.
 For stripe/parity RAID, method 1 means use other data blocks and parity
 blocks to reconstruct data.
 
 The default for non RAID devices is to return EINVAL for any N  0.
 A remapping device (dm-linear, dm-stripe etc) would just pass the number
 down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
 some thought.
 
 So if btrfs reads a block and the checksum looks wrong, it reads again with
 a larger N.  It continues incrementing N and retrying until it gets a block
 that it likes or it gets EINVAL.  There should probably be an error code
 (EAGAIN?) which means I cannot work with that number, but try the next one.
 
 It would be trivial for me to implement this for RAID1 and RAID10, and
 relatively easy for RAID5.
 I'd need to give a bit of thought to RAID6 as there are possibly multiple
 ways to reconstruct from different combinations of parity and data.  I'm not
 sure if there would be much point in doing that though.
 
 It might make sense for a device to be able to report what the maximum
 'N' supported is... that might make stacked raid easier to manage...
 
 NeilBrown
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 .
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread Alasdair G Kergon
On Thu, Jul 14, 2011 at 04:38:36PM +1000, Neil Brown wrote:
 It might make sense for a device to be able to report what the maximum
 'N' supported is... that might make stacked raid easier to manage...
 
I'll just say that any solution ought to be stackable.
This means understanding both that the number of data access routes may
vary as you move through the stack, and that this number may depend on
the offset within the device.

Alasdair

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: Mis-Design of Btrfs?

2011-07-14 Thread John Stoffel
 Alasdair == Alasdair G Kergon a...@redhat.com writes:

Alasdair On Thu, Jul 14, 2011 at 04:38:36PM +1000, Neil Brown wrote:
 It might make sense for a device to be able to report what the maximum
 'N' supported is... that might make stacked raid easier to manage...
 
Alasdair I'll just say that any solution ought to be stackable.

I've been mulling this over too and wondering how you'd handle this,
because upper layers really can't peak down into lower layers easily.
As far as I understand things.

So if you have btrfs - luks - raid1 - raid6 - nbd - remote disks

How does btrfs handle errors (or does it even see them!) from the
raid6 level when a single nbd device goes away?  Or taking the
original example, when btrfs notices a checksum isn't correct, how
would it push down multiple levels to try and find the correct data? 

Alasdair This means understanding both that the number of data access
Alasdair routes may vary as you move through the stack, and that this
Alasdair number may depend on the offset within the device.

It almost seems to me that the retry needs to be done at each level on
it's own, without pushing down or up the stack.  But this doesn't
handle the wrong file checksum issue.  

Hmm... maybe instead of just one number, we need another to count the
levels down you go (or just split 16bit integer in half, bottom half
being count of tries, the upper half being levels down to try that
read?)

It seems to defeat the purpose of layers if you can go down and find
out how many layers there are underneath you

John

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread Erik Jensen
 On Wed, Jun 29, 2011 at 3:47 AM, A. James Lewis ja...@fsck.co.uk wrote:
 Is there a possibility that one could have a 3 disk RAID5 array, and
 then add a 4th disk and then do a balance, growing the RAID5 onto 4
 disks and gaining the space still with RAID5?  It seems that to be
 consistent, BTRFS would have to do this.

 If this is the case, then I think that the BTRFS implementation of RAID5
 would have to be quite different to the MD implementation.

 James.

 My understanding, gleaned from IRC several months ago, is that Btrfs
 would use the new drive, but not change the stripe size. Each
 allocation would then be written across some selection of three of the
 four drives.

 In other words, if you started with four stripes across three drives:
   AAA
   BBB
   CCC
   DDD
 and then added a drive and balanced, you might get something like:
   AAAB
   BBCC
   CDDD
 which would give you more space, but still use ⅓ of the space for parity.

 Trying to remove a drive from the original three-drive configuration
 would be an error, similarly to trying to remove the second to last
 drive in RAID 1, currently.

 Actually changing the stripe size would be done using the same
 mechanism as changing RAID levels.

 Again, this is just an interested but uninvolved person's
 understanding based on an IRC conversation, so please salt to taste.

 -- Erik
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread david

On Thu, 14 Jul 2011, John Stoffel wrote:


Alasdair I'll just say that any solution ought to be stackable.

I've been mulling this over too and wondering how you'd handle this,
because upper layers really can't peak down into lower layers easily.
As far as I understand things.

So if you have btrfs - luks - raid1 - raid6 - nbd - remote disks

How does btrfs handle errors (or does it even see them!) from the
raid6 level when a single nbd device goes away?  Or taking the
original example, when btrfs notices a checksum isn't correct, how
would it push down multiple levels to try and find the correct data?

Alasdair This means understanding both that the number of data access
Alasdair routes may vary as you move through the stack, and that this
Alasdair number may depend on the offset within the device.

It almost seems to me that the retry needs to be done at each level on
it's own, without pushing down or up the stack.  But this doesn't
handle the wrong file checksum issue.

Hmm... maybe instead of just one number, we need another to count the
levels down you go (or just split 16bit integer in half, bottom half
being count of tries, the upper half being levels down to try that
read?)

It seems to defeat the purpose of layers if you can go down and find
out how many layers there are underneath you


this is why just an arbatrary 'method number' rather than a bitmap or 
something like that should be used.


using your example:


So if you have btrfs - luks - raid1 - raid6 - nbd - remote disks


raid1 has at least 2 values, raid 6 has at least 2 values, the combination 
of the two stacked should have at least 4 values.


if each layer can query the layer below it to find out how many methods it 
supports, it can then combine each of the methods it supports with each of 
the methods supported by the layer below it.


this will stack to an arbatrary number of layers, only limited by how 
large the value is allowed to be limiting the combinational permutations 
of all the layers options.


David Lang
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread Erik Jensen
 On Thu, Jul 14, 2011 at 12:50 PM, John Stoffel j...@stoffel.org wrote:
 Alasdair == Alasdair G Kergon a...@redhat.com writes:

 Alasdair On Thu, Jul 14, 2011 at 04:38:36PM +1000, Neil Brown wrote:
 It might make sense for a device to be able to report what the maximum
 'N' supported is... that might make stacked raid easier to manage...

 Alasdair I'll just say that any solution ought to be stackable.

 I've been mulling this over too and wondering how you'd handle this,
 because upper layers really can't peak down into lower layers easily.
 As far as I understand things.

 So if you have btrfs - luks - raid1 - raid6 - nbd - remote disks

 How does btrfs handle errors (or does it even see them!) from the
 raid6 level when a single nbd device goes away?  Or taking the
 original example, when btrfs notices a checksum isn't correct, how
 would it push down multiple levels to try and find the correct data?

 Alasdair This means understanding both that the number of data access
 Alasdair routes may vary as you move through the stack, and that this
 Alasdair number may depend on the offset within the device.

 It almost seems to me that the retry needs to be done at each level on
 it's own, without pushing down or up the stack.  But this doesn't
 handle the wrong file checksum issue.

 Hmm... maybe instead of just one number, we need another to count the
 levels down you go (or just split 16bit integer in half, bottom half
 being count of tries, the upper half being levels down to try that
 read?)

 It seems to defeat the purpose of layers if you can go down and find
 out how many layers there are underneath you

 John

A random thought: What if we allow the number to wrap at each level,
and, each time it wraps, increment the number passed to the next lower
level.

A zero would propagate down, letting each level do what it wants:
luks: 0
raid1: 0
raid6: 0
nbd: 0

And higher numbers would indicate the method at each level:

For a 1:
luks: 1
raid1: 1
raid6: 1
nbd: 1

For a 3:
luks: 1 (only one possibility, passes three down)
raid1: 1 (two possibilities, so we wrap back to one and pass two down,
since we wrapped once)
raid6: 2 (not wrapped)
nbd: 1

When the bottom-most level gets an N that it can't handle, it would
return EINVAL, which would be propagated up the stack.

This would allow the same algorithm of incrementing N until we receive
good data or EINVAL, and would exhaust all ways of reading the data at
all levels.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread Chris Mason
Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:
 On 07/14/2011 07:38 AM, NeilBrown wrote:
  On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheelerrwhee...@redhat.com  wrote:
 
  I'm certainly open to suggestions and collaboration.  Do you have in mind 
  any
  particular way to make the interface richer??
 
  NeilBrown
  Hi Neil,
 
  I know that Chris has a very specific set of use cases for btrfs and think 
  that
  Alasdair and others have started to look at what is doable.
 
  The obvious use case is the following:
 
  If a file system uses checksumming or other data corruption detection 
  bits, it
  can detect that it got bad data on a write. If that data was protected by 
  RAID,
  it would like to ask the block layer to try to read from another mirror 
  (for
  raid1) or try to validate/rebuild from parity.
 
  Today, I think that a retry will basically just give us back a random 
  chance of
  getting data from a different mirror or the same one that we got data from 
  on
  the first go.
 
  Chris, Alasdair, was that a good summary of one concern?
 
  Thanks!
 
  Ric
  I imagine a new field in 'struct bio' which was normally zero but could be
  some small integer.  It is only meaningful for read.
  When 0 it means get this data way you like.
  When non-zero it means get this data using method N, where the different
  methods are up to the device.
 
  For a mirrored RAID, method N means read from device N-1.
  For stripe/parity RAID, method 1 means use other data blocks and parity
  blocks to reconstruct data.
 
  The default for non RAID devices is to return EINVAL for any N  0.
  A remapping device (dm-linear, dm-stripe etc) would just pass the number
  down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
  some thought.
 
  So if btrfs reads a block and the checksum looks wrong, it reads again with
  a larger N.  It continues incrementing N and retrying until it gets a block
  that it likes or it gets EINVAL.  There should probably be an error code
  (EAGAIN?) which means I cannot work with that number, but try the next 
  one.
 
  It would be trivial for me to implement this for RAID1 and RAID10, and
  relatively easy for RAID5.
  I'd need to give a bit of thought to RAID6 as there are possibly multiple
  ways to reconstruct from different combinations of parity and data.  I'm not
  sure if there would be much point in doing that though.
 
  It might make sense for a device to be able to report what the maximum
  'N' supported is... that might make stacked raid easier to manage...
 
  NeilBrown
 
 
 I think that the above makes sense. Not sure what your 0 definition is, but 
 I 
 would assume that for non-raided devices (i.e., a single s-ata disk), 0 
 would 
 be an indication that there is nothing more that can be tried. The data you 
 got 
 is all you are ever going to get :)
 
 For multiple mirrors, you might want to have a scheme where you would be able 
 to 
 cycle through the mirrors. You could retry, cycling through the various 
 mirrors 
 until you have tried and returned them all at which point you would get a no 
 more error back or some such thing.

Hi everyone,

The mirror number idea is basically what btrfs does today, and I think
it fits in with Neil's idea to have different policies for different
blocks.

Basically what btrfs does:

read_block(block_num, mirror = 0)
if (no io error and not csum error)
horray()

num_mirrors = get_num_copies(block number)
for (i = 1; i  num_mirrors; i++) {
read_block(block_num, mirror = i);
}

In a stacked configuration, the get_num_copies function can be smarter,
basically adding up all the copies of the lower levels and finding a way
to combine them.  We could just send down a fake bio that is responsible
for adding up the storage combinations into a bitmask or whatever works.

We could also just keep retrying until the lower layers reported no more
mirror were available.

In btrfs at least, we don't set the data blocks up to date until the crc
has passed, so replacing bogus blocks is easy.  Metadata is a little
more complex, but that's not really related to this topic.

mirror number 0 just means no mirror preference/pick the fastest
mirror to the btrfs block mapping code.

Btrfs has the concept of different raid levels for different logical
block numbers, so you get_num_copies might return one answer for block A
and a different answer for block B.

Either way, we could easily make use of a bio field here if it were
exported out.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread david

On Thu, 14 Jul 2011, Chris Mason wrote:


Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:

On 07/14/2011 07:38 AM, NeilBrown wrote:

On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheelerrwhee...@redhat.com  wrote:


I'm certainly open to suggestions and collaboration.  Do you have in mind any
particular way to make the interface richer??

NeilBrown

Hi Neil,

I know that Chris has a very specific set of use cases for btrfs and think that
Alasdair and others have started to look at what is doable.

The obvious use case is the following:

If a file system uses checksumming or other data corruption detection bits, it
can detect that it got bad data on a write. If that data was protected by RAID,
it would like to ask the block layer to try to read from another mirror (for
raid1) or try to validate/rebuild from parity.

Today, I think that a retry will basically just give us back a random chance of
getting data from a different mirror or the same one that we got data from on
the first go.

Chris, Alasdair, was that a good summary of one concern?

Thanks!

Ric

I imagine a new field in 'struct bio' which was normally zero but could be
some small integer.  It is only meaningful for read.
When 0 it means get this data way you like.
When non-zero it means get this data using method N, where the different
methods are up to the device.

For a mirrored RAID, method N means read from device N-1.
For stripe/parity RAID, method 1 means use other data blocks and parity
blocks to reconstruct data.

The default for non RAID devices is to return EINVAL for any N  0.
A remapping device (dm-linear, dm-stripe etc) would just pass the number
down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
some thought.

So if btrfs reads a block and the checksum looks wrong, it reads again with
a larger N.  It continues incrementing N and retrying until it gets a block
that it likes or it gets EINVAL.  There should probably be an error code
(EAGAIN?) which means I cannot work with that number, but try the next one.

It would be trivial for me to implement this for RAID1 and RAID10, and
relatively easy for RAID5.
I'd need to give a bit of thought to RAID6 as there are possibly multiple
ways to reconstruct from different combinations of parity and data.  I'm not
sure if there would be much point in doing that though.

It might make sense for a device to be able to report what the maximum
'N' supported is... that might make stacked raid easier to manage...

NeilBrown



I think that the above makes sense. Not sure what your 0 definition is, but I
would assume that for non-raided devices (i.e., a single s-ata disk), 0 would
be an indication that there is nothing more that can be tried. The data you got
is all you are ever going to get :)

For multiple mirrors, you might want to have a scheme where you would be able to
cycle through the mirrors. You could retry, cycling through the various mirrors
until you have tried and returned them all at which point you would get a no
more error back or some such thing.


Hi everyone,

The mirror number idea is basically what btrfs does today, and I think
it fits in with Neil's idea to have different policies for different
blocks.

Basically what btrfs does:

read_block(block_num, mirror = 0)
if (no io error and not csum error)
horray()

num_mirrors = get_num_copies(block number)
for (i = 1; i  num_mirrors; i++) {
read_block(block_num, mirror = i);
}

In a stacked configuration, the get_num_copies function can be smarter,
basically adding up all the copies of the lower levels and finding a way
to combine them.  We could just send down a fake bio that is responsible
for adding up the storage combinations into a bitmask or whatever works.

We could also just keep retrying until the lower layers reported no more
mirror were available.

In btrfs at least, we don't set the data blocks up to date until the crc
has passed, so replacing bogus blocks is easy.  Metadata is a little
more complex, but that's not really related to this topic.

mirror number 0 just means no mirror preference/pick the fastest
mirror to the btrfs block mapping code.

Btrfs has the concept of different raid levels for different logical
block numbers, so you get_num_copies might return one answer for block A
and a different answer for block B.

Either way, we could easily make use of a bio field here if it were
exported out.


you don't want to just pass the value down as that will cause problems 
with layering (especially if the lower layer supports more values than a 
higher layer)


I would suggest that each layer take the value it's give, do an integer 
division by the number of values that layer supports, using the modulo 
value for that layer and pass the rest of the result down to the next 
layer.


this works for just trying values until you get the error that tells you 
there are no more options.



things can get very 'intersesting' if the different options below you have 

Re: Mis-Design of Btrfs?

2011-07-13 Thread NeilBrown
On Wed, 29 Jun 2011 10:29:53 +0100 Ric Wheeler rwhee...@redhat.com wrote:

 On 06/27/2011 07:46 AM, NeilBrown wrote:
  On Thu, 23 Jun 2011 12:53:37 +0200 Nico Schottelius
  nico-lkml-20110...@schottelius.org  wrote:
 
  Good morning devs,
 
  I'm wondering whether the raid- and volume-management-builtin of btrfs is
  actually a sane idea or not.
  Currently we do have md/device-mapper support for raid
  already, btrfs lacks raid5 support and re-implements stuff that
  has already been done.
 
  I'm aware of the fact that it is very useful to know on which devices
  we are in a filesystem. But I'm wondering, whether it wouldn't be
  smarter to generalise the information exposure through the VFS layer
  instead of replicating functionality:
 
  Physical:   USB-HD   SSD   USB-Flash  | Exposes information to
  Raid:   Raid1, Raid5, Raid10, etc.| higher levels
  Crypto: Luks  |
  LVM:Groups/Volumes|
  FS: xfs/jfs/reiser/ext3   v
 
  Thus a filesystem like ext3 could be aware that it is running
  on a USB HD, enable -o sync be default or have the filesystem
  to rewrite blocks when running on crypto or optimise for an SSD, ...
  I would certainly agree that exposing information to higher levels is a good
  idea.  To some extent we do.  But it isn't always as easy as it might sound.
  Choosing exactly what information to expose is the challenge.  If you lack
  sufficient foresight you might expose something which turns out to be
  very specific to just one device, so all those upper levels which make use 
  of
  the information find they are really special-casing one specific device,
  which isn't a good idea.
 
 
  However it doesn't follow that RAID5 should not be implemented in BTRFS.
  The levels that you have drawn are just one perspective.  While that has
  value, it may not be universal.
  I could easily argue that the LVM layer is a mistake and that filesystems
  should provide that functionality directly.
  I could almost argue the same for crypto.
  RAID1 can make a lot of sense to be tightly integrated with the FS.
  RAID5 ... I'm less convinced, but then I have a vested interest there so 
  that
  isn't an objective assessment.
 
  Part of the way Linux works is that s/he who writes the code gets to make
  the design decisions.   The BTRFS developers might create something truly
  awesome, or might end up having to support a RAID feature that they
  subsequently think is a bad idea.  But it really is their decision to make.
 
  NeilBrown
 
 
 One more thing to add here is that I think that we still have a chance to 
 increase the sharing between btrfs and the MD stack if we can get those 
 changes 
 made. No one likes to duplicate code, but we will need a richer interface 
 between the block and file system layer to help close that gap.
 
 Ric
 

I'm certainly open to suggestions and collaboration.  Do you have in mind any
particular way to make the interface richer??

NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-06-29 Thread Ric Wheeler

On 06/27/2011 07:46 AM, NeilBrown wrote:

On Thu, 23 Jun 2011 12:53:37 +0200 Nico Schottelius
nico-lkml-20110...@schottelius.org  wrote:


Good morning devs,

I'm wondering whether the raid- and volume-management-builtin of btrfs is
actually a sane idea or not.
Currently we do have md/device-mapper support for raid
already, btrfs lacks raid5 support and re-implements stuff that
has already been done.

I'm aware of the fact that it is very useful to know on which devices
we are in a filesystem. But I'm wondering, whether it wouldn't be
smarter to generalise the information exposure through the VFS layer
instead of replicating functionality:

Physical:   USB-HD   SSD   USB-Flash  | Exposes information to
Raid:   Raid1, Raid5, Raid10, etc.| higher levels
Crypto: Luks  |
LVM:Groups/Volumes|
FS: xfs/jfs/reiser/ext3   v

Thus a filesystem like ext3 could be aware that it is running
on a USB HD, enable -o sync be default or have the filesystem
to rewrite blocks when running on crypto or optimise for an SSD, ...

I would certainly agree that exposing information to higher levels is a good
idea.  To some extent we do.  But it isn't always as easy as it might sound.
Choosing exactly what information to expose is the challenge.  If you lack
sufficient foresight you might expose something which turns out to be
very specific to just one device, so all those upper levels which make use of
the information find they are really special-casing one specific device,
which isn't a good idea.


However it doesn't follow that RAID5 should not be implemented in BTRFS.
The levels that you have drawn are just one perspective.  While that has
value, it may not be universal.
I could easily argue that the LVM layer is a mistake and that filesystems
should provide that functionality directly.
I could almost argue the same for crypto.
RAID1 can make a lot of sense to be tightly integrated with the FS.
RAID5 ... I'm less convinced, but then I have a vested interest there so that
isn't an objective assessment.

Part of the way Linux works is that s/he who writes the code gets to make
the design decisions.   The BTRFS developers might create something truly
awesome, or might end up having to support a RAID feature that they
subsequently think is a bad idea.  But it really is their decision to make.

NeilBrown



One more thing to add here is that I think that we still have a chance to 
increase the sharing between btrfs and the MD stack if we can get those changes 
made. No one likes to duplicate code, but we will need a richer interface 
between the block and file system layer to help close that gap.


Ric


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-06-29 Thread A. James Lewis
On Wed, 2011-06-29 at 10:29 +0100, Ric Wheeler wrote:
 On 06/27/2011 07:46 AM, NeilBrown wrote:
  On Thu, 23 Jun 2011 12:53:37 +0200 Nico Schottelius
  nico-lkml-20110...@schottelius.org  wrote:
 
  Good morning devs,
 
  I'm wondering whether the raid- and volume-management-builtin of btrfs is
  actually a sane idea or not.
  Currently we do have md/device-mapper support for raid
  already, btrfs lacks raid5 support and re-implements stuff that
  has already been done.
 
  I'm aware of the fact that it is very useful to know on which devices
  we are in a filesystem. But I'm wondering, whether it wouldn't be
  smarter to generalise the information exposure through the VFS layer
  instead of replicating functionality:
 
  Physical:   USB-HD   SSD   USB-Flash  | Exposes information to
  Raid:   Raid1, Raid5, Raid10, etc.| higher levels
  Crypto: Luks  |
  LVM:Groups/Volumes|
  FS: xfs/jfs/reiser/ext3   v
 
  Thus a filesystem like ext3 could be aware that it is running
  on a USB HD, enable -o sync be default or have the filesystem
  to rewrite blocks when running on crypto or optimise for an SSD, ...
  I would certainly agree that exposing information to higher levels is a good
  idea.  To some extent we do.  But it isn't always as easy as it might sound.
  Choosing exactly what information to expose is the challenge.  If you lack
  sufficient foresight you might expose something which turns out to be
  very specific to just one device, so all those upper levels which make use 
  of
  the information find they are really special-casing one specific device,
  which isn't a good idea.
 
 
  However it doesn't follow that RAID5 should not be implemented in BTRFS.
  The levels that you have drawn are just one perspective.  While that has
  value, it may not be universal.
  I could easily argue that the LVM layer is a mistake and that filesystems
  should provide that functionality directly.
  I could almost argue the same for crypto.
  RAID1 can make a lot of sense to be tightly integrated with the FS.
  RAID5 ... I'm less convinced, but then I have a vested interest there so 
  that
  isn't an objective assessment.
 
  Part of the way Linux works is that s/he who writes the code gets to make
  the design decisions.   The BTRFS developers might create something truly
  awesome, or might end up having to support a RAID feature that they
  subsequently think is a bad idea.  But it really is their decision to make.
 
  NeilBrown
 
 
 One more thing to add here is that I think that we still have a chance to 
 increase the sharing between btrfs and the MD stack if we can get those 
 changes 
 made. No one likes to duplicate code, but we will need a richer interface 
 between the block and file system layer to help close that gap.
 
 Ric
 
Is there a possibility that one could have a 3 disk RAID5 array, and
then add a 4th disk and then do a balance, growing the RAID5 onto 4
disks and gaining the space still with RAID5?  It seems that to be
consistent, BTRFS would have to do this.

If this is the case, then I think that the BTRFS implementation of RAID5
would have to be quite different to the MD implementation.

James.

 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[nico-lkml-20110...@schottelius.org: Mis-Design of Btrfs?]

2011-06-24 Thread David Sterba
- Forwarded message from Nico Schottelius 
nico-lkml-20110...@schottelius.org -

To: LKML linux-ker...@vger.kernel.org
Date: Thu, 23 Jun 2011 12:53:37 +0200
From: Nico Schottelius nico-lkml-20110...@schottelius.org
Subject: Mis-Design of Btrfs?

Good morning devs,

I'm wondering whether the raid- and volume-management-builtin of btrfs is
actually a sane idea or not.
Currently we do have md/device-mapper support for raid
already, btrfs lacks raid5 support and re-implements stuff that
has already been done.

I'm aware of the fact that it is very useful to know on which devices
we are in a filesystem. But I'm wondering, whether it wouldn't be
smarter to generalise the information exposure through the VFS layer
instead of replicating functionality:

Physical:   USB-HD   SSD   USB-Flash  | Exposes information to
Raid:   Raid1, Raid5, Raid10, etc.| higher levels
Crypto: Luks  |
LVM:Groups/Volumes|
FS: xfs/jfs/reiser/ext3   v

Thus a filesystem like ext3 could be aware that it is running
on a USB HD, enable -o sync be default or have the filesystem
to rewrite blocks when running on crypto or optimise for an SSD, ...

Cheers,

Nico

-- 
PGP key: 7ED9 F7D3 6B10 81D7 0EC5  5C09 D7DC C8E4 3187 7DF0
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

- End forwarded message -
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html