Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid

2018-02-04 Thread Anand Jain




So usually this should be functionality handled by the raid/san
controller I guess, > but given that btrfs is playing the role of a
controller here at what point are we drawing the line of not
implementing block-level functionality into the filesystem ?


    Don't worry this is not invading into the block layer. How
    can you even build this functionality in the block layer ?
    Block layer even won't know that disks are mirrored. RAID
    does or BTRFS in our case.



By block layer I guess I meant the storage driver of a particular 
raid card. Because what is currently happening is re-implementing
functionality that will generally sit in the driver. So my 
question was more generic and high-level - at what point do we draw the line of
implementing feature that are generally implemented in hardware 
devices (be it their drivers or firmware).


   Not all HW configs use RAID capable HBAs. A server connected to 
a SATA JBOD using a SATA HBA without MD will relay on BTRFS to provide all
the features and capabilities that otherwise would have provided by 
such a presumable HW config.


That does sort of sound like means implementing some portion of the
HBA features/capabilities in the filesystem.

To me it seems this this could be workable at the fs level, provided it
deals just with policies and remains hardware-neutral.





  Thanks. Ok.


However most
of the use cases appear to involve some hardware-dependent knowledge
or assumptions.



What happens when someone sets this on a virtual disk,
or say a (persistent) memory-backed block device?


  Do you have any policy in particular ?


No, this is your proposal ;^)



 Policy added here:
   devid
 It is about the devid which is assigned by the btrfs.
 Future policy:
LBA/ssd/io/Tims-heuristic
 They aren't hardware dependent though ssd says use ssd
 disk for reading if available. LBA is to divide the read
 IO access based on the sector #. The logic is quite simple
  read-sector < FS-SIZE/2 ? mirror1 : mirror2;


You've said cases #3 thru #6 are illustrative only. However they make
assumptions about the underlying storage, and/or introduce potential for
unexpected behaviors.


 The assumptions I am making is that user will understand their
 storage and tune this parameter accordingly, and there is heuristic
 (which Tim wrote) to do things automatically. Sometimes manual settings
 provide better performance than heuristic.


Plus they could end up replicating functionality
from other layers as Nikolay pointed out. Seems unlikely these would be
practical to implement.


The I/O one would actually be rather nice to have and wouldn't really be 
duplicating anything (at least, not duplicating anything we consistently 
run on top of).  The pid-based selector works fine for cases where the 
only thing on the disks is a single BTRFS filesystem.  When there's more 
than that, it can very easily result in highly asymmetrical load on the 
disks because it doesn't account for current I/O load when picking a 
copy to read.  Last I checked, both MD and DM-RAID have at least the 
option to use I/O load in determining where to send reads for RAID1 
setups, and they do a far better job than BTRFS at balancing load in 
these cases.


 Yeah.. some enterprise FS and storage communicate performance
 tunability automatically between each other. We will be there too.


Case #2 seems concerning if it exposes internal,
implementation-dependent filesystem data into a de facto user-level
interface. (Do we ensure the devid is unique, and cannot get changed or
re-assigned internally to a different device, etc?)
The devid gets assigned when a device is added to a filesystem, it's a 
monotonically increasing number that gets incremented for every new 
device, and never changes for a given device as long as it remains in 
the filesystem (it will change if you remove the device and then re-add 
it).  The only exception to this is that the replace command will assign 
the new device the same devid that the device it is replacing had (which 
I would argue leads to consistent behavior here).  Given that, I think 
it's sufficiently safe to use it for something like this.



Thanks, Anand



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid

2018-02-02 Thread Austin S. Hemmelgarn

On 2018-02-01 18:46, Edmund Nadolski wrote:



On 02/01/2018 01:12 AM, Anand Jain wrote:



On 02/01/2018 01:26 PM, Edmund Nadolski wrote:

On 1/31/18 7:36 AM, Anand Jain wrote:



On 01/31/2018 09:42 PM, Nikolay Borisov wrote:



So usually this should be functionality handled by the raid/san
controller I guess, > but given that btrfs is playing the role of a
controller here at what point are we drawing the line of not
implementing block-level functionality into the filesystem ?


Don't worry this is not invading into the block layer. How
can you even build this functionality in the block layer ?
Block layer even won't know that disks are mirrored. RAID
does or BTRFS in our case.



By block layer I guess I meant the storage driver of a particular raid
card. Because what is currently happening is re-implementing
functionality that will generally sit in the driver. So my question was
more generic and high-level - at what point do we draw the line of
implementing feature that are generally implemented in hardware devices
(be it their drivers or firmware).


   Not all HW configs use RAID capable HBAs. A server connected to a SATA
   JBOD using a SATA HBA without MD will relay on BTRFS to provide all
the
   features and capabilities that otherwise would have provided by such a
   presumable HW config.


That does sort of sound like means implementing some portion of the
HBA features/capabilities in the filesystem.

To me it seems this this could be workable at the fs level, provided it
deals just with policies and remains hardware-neutral.


  Thanks. Ok.


However most
of the use cases appear to involve some hardware-dependent knowledge
or assumptions.



What happens when someone sets this on a virtual disk,
or say a (persistent) memory-backed block device?


  Do you have any policy in particular ?


No, this is your proposal ;^)

You've said cases #3 thru #6 are illustrative only. However they make
assumptions about the underlying storage, and/or introduce potential for
unexpected behaviors. Plus they could end up replicating functionality
from other layers as Nikolay pointed out. Seems unlikely these would be
practical to implement.
The I/O one would actually be rather nice to have and wouldn't really be 
duplicating anything (at least, not duplicating anything we consistently 
run on top of).  The pid-based selector works fine for cases where the 
only thing on the disks is a single BTRFS filesystem.  When there's more 
than that, it can very easily result in highly asymmetrical load on the 
disks because it doesn't account for current I/O load when picking a 
copy to read.  Last I checked, both MD and DM-RAID have at least the 
option to use I/O load in determining where to send reads for RAID1 
setups, and they do a far better job than BTRFS at balancing load in 
these cases.


Case #2 seems concerning if it exposes internal,
implementation-dependent filesystem data into a de facto user-level
interface. (Do we ensure the devid is unique, and cannot get changed or
re-assigned internally to a different device, etc?)
The devid gets assigned when a device is added to a filesystem, it's a 
monotonically increasing number that gets incremented for every new 
device, and never changes for a given device as long as it remains in 
the filesystem (it will change if you remove the device and then re-add 
it).  The only exception to this is that the replace command will assign 
the new device the same devid that the device it is replacing had (which 
I would argue leads to consistent behavior here).  Given that, I think 
it's sufficiently safe to use it for something like this.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid

2018-02-02 Thread Edmund Nadolski


On 02/01/2018 01:12 AM, Anand Jain wrote:
> 
> 
> On 02/01/2018 01:26 PM, Edmund Nadolski wrote:
>> On 1/31/18 7:36 AM, Anand Jain wrote:
>>>
>>>
>>> On 01/31/2018 09:42 PM, Nikolay Borisov wrote:
>>>
>>>
>> So usually this should be functionality handled by the raid/san
>> controller I guess, > but given that btrfs is playing the role of a
>> controller here at what point are we drawing the line of not
>> implementing block-level functionality into the filesystem ?
>
>Don't worry this is not invading into the block layer. How
>can you even build this functionality in the block layer ?
>Block layer even won't know that disks are mirrored. RAID
>does or BTRFS in our case.
>

 By block layer I guess I meant the storage driver of a particular raid
 card. Because what is currently happening is re-implementing
 functionality that will generally sit in the driver. So my question was
 more generic and high-level - at what point do we draw the line of
 implementing feature that are generally implemented in hardware devices
 (be it their drivers or firmware).
>>>
>>>   Not all HW configs use RAID capable HBAs. A server connected to a SATA
>>>   JBOD using a SATA HBA without MD will relay on BTRFS to provide all
>>> the
>>>   features and capabilities that otherwise would have provided by such a
>>>   presumable HW config.
>>
>> That does sort of sound like means implementing some portion of the
>> HBA features/capabilities in the filesystem.
>>
>> To me it seems this this could be workable at the fs level, provided it
>> deals just with policies and remains hardware-neutral.
> 
>  Thanks. Ok.
> 
>> However most
>> of the use cases appear to involve some hardware-dependent knowledge
>> or assumptions. 
> 
>> What happens when someone sets this on a virtual disk,
>> or say a (persistent) memory-backed block device? 
> 
>  Do you have any policy in particular ?

No, this is your proposal ;^)

You've said cases #3 thru #6 are illustrative only. However they make
assumptions about the underlying storage, and/or introduce potential for
unexpected behaviors. Plus they could end up replicating functionality
from other layers as Nikolay pointed out. Seems unlikely these would be
practical to implement.

Case #2 seems concerning if it exposes internal,
implementation-dependent filesystem data into a de facto user-level
interface. (Do we ensure the devid is unique, and cannot get changed or
re-assigned internally to a different device, etc?)

Thanks,
Ed


>> Case #6 seems to
>> open up some potential for unexpected interactions (which may be hard
>> to reproduce, esp. in error/recovery scenarios).
> 
>  Yep. Even the #1 pid based (current default) which motivated
>  me to provide the devid based policy.
> 
>> Case #2 takes a devid, but I notice btrfs_device::devid says, "the
>> internal btrfs device id".  How does a user obtain that internal value
>> so it can be set as a mount option?
> 
>   btrfs fi show -m
> 
> Thanks, Anand
> 
>> Thanks,
>> Ed
>>
>>
>>> ::
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 39ba59832f38..478623e6e074 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct
> btrfs_fs_info *fs_info,
>  num = map->num_stripes;
>switch(fs_info->read_mirror_policy) {
> +case BTRFS_READ_MIRROR_BY_DEV:
> +optimal = first;
> +if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
> + >stripes[optimal].dev->dev_state))
> +break;
> +if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
> + >stripes[++optimal].dev->dev_state))
> +break;
> +optimal = first;

 you set optimal 2 times, the second one seems redundant.
>>>
>>> No actually. When both the disks containing the stripe does not
>>> have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to
>>> use first found stripe.
>>
>> Yes, and the fact that you've already set optimal = first right after
>> BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again
>> set
>> optimal right before the final break? What am I missing here?
>
> Ah. I think you are missing ++optimal in the 2nd if.

 You are right, but I'd prefer you index the stripes array with
 'optimal'
 and 'optimal + 1' and leave just a single assignment
>>>
>>>   Ok. Will improve that.
>>>
>>> Thanks, Anand
>>>
>>>
>
> Thanks, Anand
>
 -- 
 To unsubscribe from this list: send the line "unsubscribe
 linux-btrfs" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

>>> -- 
>>> To unsubscribe from this list: send 

Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid

2018-02-01 Thread Anand Jain



On 02/01/2018 01:26 PM, Edmund Nadolski wrote:

On 1/31/18 7:36 AM, Anand Jain wrote:



On 01/31/2018 09:42 PM, Nikolay Borisov wrote:



So usually this should be functionality handled by the raid/san
controller I guess, > but given that btrfs is playing the role of a
controller here at what point are we drawing the line of not
implementing block-level functionality into the filesystem ?


   Don't worry this is not invading into the block layer. How
   can you even build this functionality in the block layer ?
   Block layer even won't know that disks are mirrored. RAID
   does or BTRFS in our case.



By block layer I guess I meant the storage driver of a particular raid
card. Because what is currently happening is re-implementing
functionality that will generally sit in the driver. So my question was
more generic and high-level - at what point do we draw the line of
implementing feature that are generally implemented in hardware devices
(be it their drivers or firmware).


  Not all HW configs use RAID capable HBAs. A server connected to a SATA
  JBOD using a SATA HBA without MD will relay on BTRFS to provide all the
  features and capabilities that otherwise would have provided by such a
  presumable HW config.


That does sort of sound like means implementing some portion of the
HBA features/capabilities in the filesystem.

To me it seems this this could be workable at the fs level, provided it
deals just with policies and remains hardware-neutral.


 Thanks. Ok.


However most
of the use cases appear to involve some hardware-dependent knowledge
or assumptions. 



What happens when someone sets this on a virtual disk,
or say a (persistent) memory-backed block device? 


 Do you have any policy in particular ?


Case #6 seems to
open up some potential for unexpected interactions (which may be hard
to reproduce, esp. in error/recovery scenarios).


 Yep. Even the #1 pid based (current default) which motivated
 me to provide the devid based policy.


Case #2 takes a devid, but I notice btrfs_device::devid says, "the
internal btrfs device id".  How does a user obtain that internal value
so it can be set as a mount option?


  btrfs fi show -m

Thanks, Anand


Thanks,
Ed



::

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 39ba59832f38..478623e6e074 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5270,6 +5270,16 @@ static int find_live_mirror(struct
btrfs_fs_info *fs_info,
     num = map->num_stripes;
       switch(fs_info->read_mirror_policy) {
+    case BTRFS_READ_MIRROR_BY_DEV:
+    optimal = first;
+    if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
+ >stripes[optimal].dev->dev_state))
+    break;
+    if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
+ >stripes[++optimal].dev->dev_state))
+    break;
+    optimal = first;


you set optimal 2 times, the second one seems redundant.


    No actually. When both the disks containing the stripe does not
    have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to
    use first found stripe.


Yes, and the fact that you've already set optimal = first right after
BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again
set
optimal right before the final break? What am I missing here?


    Ah. I think you are missing ++optimal in the 2nd if.


You are right, but I'd prefer you index the stripes array with 'optimal'
and 'optimal + 1' and leave just a single assignment


  Ok. Will improve that.

Thanks, Anand




Thanks, Anand


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid

2018-01-31 Thread Edmund Nadolski
On 1/31/18 7:36 AM, Anand Jain wrote:
> 
> 
> On 01/31/2018 09:42 PM, Nikolay Borisov wrote:
> 
> 
 So usually this should be functionality handled by the raid/san
 controller I guess, > but given that btrfs is playing the role of a
 controller here at what point are we drawing the line of not
 implementing block-level functionality into the filesystem ?
>>>
>>>   Don't worry this is not invading into the block layer. How
>>>   can you even build this functionality in the block layer ?
>>>   Block layer even won't know that disks are mirrored. RAID
>>>   does or BTRFS in our case.
>>>
>>
>> By block layer I guess I meant the storage driver of a particular raid
>> card. Because what is currently happening is re-implementing
>> functionality that will generally sit in the driver. So my question was
>> more generic and high-level - at what point do we draw the line of
>> implementing feature that are generally implemented in hardware devices
>> (be it their drivers or firmware).
> 
>  Not all HW configs use RAID capable HBAs. A server connected to a SATA
>  JBOD using a SATA HBA without MD will relay on BTRFS to provide all the
>  features and capabilities that otherwise would have provided by such a
>  presumable HW config.

That does sort of sound like means implementing some portion of the
HBA features/capabilities in the filesystem.

To me it seems this this could be workable at the fs level, provided it
deals just with policies and remains hardware-neutral.  However most
of the use cases appear to involve some hardware-dependent knowledge
or assumptions.  What happens when someone sets this on a virtual disk,
or say a (persistent) memory-backed block device?  Case #6 seems to
open up some potential for unexpected interactions (which may be hard
to reproduce, esp. in error/recovery scenarios).

Case #2 takes a devid, but I notice btrfs_device::devid says, "the
internal btrfs device id".  How does a user obtain that internal value
so it can be set as a mount option?

Thanks,
Ed


> ::
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index 39ba59832f38..478623e6e074 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct
>>> btrfs_fs_info *fs_info,
>>>     num = map->num_stripes;
>>>       switch(fs_info->read_mirror_policy) {
>>> +    case BTRFS_READ_MIRROR_BY_DEV:
>>> +    optimal = first;
>>> +    if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>>> + >stripes[optimal].dev->dev_state))
>>> +    break;
>>> +    if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>>> + >stripes[++optimal].dev->dev_state))
>>> +    break;
>>> +    optimal = first;
>>
>> you set optimal 2 times, the second one seems redundant.
>
>    No actually. When both the disks containing the stripe does not
>    have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to
>    use first found stripe.

 Yes, and the fact that you've already set optimal = first right after
 BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again
 set
 optimal right before the final break? What am I missing here?
>>>
>>>    Ah. I think you are missing ++optimal in the 2nd if.
>>
>> You are right, but I'd prefer you index the stripes array with 'optimal'
>> and 'optimal + 1' and leave just a single assignment
> 
>  Ok. Will improve that.
> 
> Thanks, Anand
> 
> 
>>>
>>> Thanks, Anand
>>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid

2018-01-31 Thread Anand Jain



On 01/31/2018 09:42 PM, Nikolay Borisov wrote:



So usually this should be functionality handled by the raid/san
controller I guess, > but given that btrfs is playing the role of a
controller here at what point are we drawing the line of not
implementing block-level functionality into the filesystem ?


  Don't worry this is not invading into the block layer. How
  can you even build this functionality in the block layer ?
  Block layer even won't know that disks are mirrored. RAID
  does or BTRFS in our case.



By block layer I guess I meant the storage driver of a particular raid
card. Because what is currently happening is re-implementing
functionality that will generally sit in the driver. So my question was
more generic and high-level - at what point do we draw the line of
implementing feature that are generally implemented in hardware devices
(be it their drivers or firmware).


 Not all HW configs use RAID capable HBAs. A server connected to a SATA
 JBOD using a SATA HBA without MD will relay on BTRFS to provide all the
 features and capabilities that otherwise would have provided by such a
 presumable HW config.

::


::

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 39ba59832f38..478623e6e074 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5270,6 +5270,16 @@ static int find_live_mirror(struct
btrfs_fs_info *fs_info,
    num = map->num_stripes;
      switch(fs_info->read_mirror_policy) {
+    case BTRFS_READ_MIRROR_BY_DEV:
+    optimal = first;
+    if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
+ >stripes[optimal].dev->dev_state))
+    break;
+    if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
+ >stripes[++optimal].dev->dev_state))
+    break;
+    optimal = first;


you set optimal 2 times, the second one seems redundant.


   No actually. When both the disks containing the stripe does not
   have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to
   use first found stripe.


Yes, and the fact that you've already set optimal = first right after
BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again set
optimal right before the final break? What am I missing here?


   Ah. I think you are missing ++optimal in the 2nd if.


You are right, but I'd prefer you index the stripes array with 'optimal'
and 'optimal + 1' and leave just a single assignment


 Ok. Will improve that.

Thanks, Anand




Thanks, Anand


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid

2018-01-31 Thread Nikolay Borisov


On 31.01.2018 15:38, Anand Jain wrote:
> 
> 
> On 01/31/2018 05:54 PM, Nikolay Borisov wrote:
>>
>>
>> On 31.01.2018 11:28, Anand Jain wrote:
>>>
>>>
>>> On 01/31/2018 04:38 PM, Nikolay Borisov wrote:


 On 30.01.2018 08:30, Anand Jain wrote:
> Adds the mount option:
>     mount -o read_mirror_policy=
>
> To set the devid of the device which should be used for read. That
> means all the normal reads will go to that particular device only.
>
> This also helps testing and gives a better control for the test
> scripts including mount context reads.

 Some code comments below. OTOH, does such policy really make sense,
 what
 happens if the selected device fails, will the other mirror be retried?
>>>
>>>   Everything as usual, read_mirror_policy=devid just lets the user to
>>>   specify his read optimized disk, so that we don't depend on the pid
>>>   to pick a stripe mirrored disk, and instead we would pick as suggested
>>>   by the user, and if that disk fails then we go back to the other
>>> mirror
>>>   which may not be the read optimized disk as we have no other choice.
>>>
 If the answer to the previous question is positive then why do we
 really
 care which device is going to be tried first?
>>>
>>>   It matters.
>>>     - If you are reading from both disks alternatively, then it
>>>   duplicates the LUN cache on the storage.
>>>     - Some disks are read-optimized and using that for reading and going
>>>   back to the other disk only when this disk fails provides a better
>>>   overall read performance.
>>
>> So usually this should be functionality handled by the raid/san
>> controller I guess, > but given that btrfs is playing the role of a
>> controller here at what point are we drawing the line of not
>> implementing block-level functionality into the filesystem ?
> 
>  Don't worry this is not invading into the block layer. How
>  can you even build this functionality in the block layer ?
>  Block layer even won't know that disks are mirrored. RAID
>  does or BTRFS in our case.
> 

By block layer I guess I meant the storage driver of a particular raid
card. Because what is currently happening is re-implementing
functionality that will generally sit in the driver. So my question was
more generic and high-level - at what point do we draw the line of
implementing feature that are generally implemented in hardware devices
(be it their drivers or firmware).

>>> ::
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 39ba59832f38..478623e6e074 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct
> btrfs_fs_info *fs_info,
>    num = map->num_stripes;
>      switch(fs_info->read_mirror_policy) {
> +    case BTRFS_READ_MIRROR_BY_DEV:
> +    optimal = first;
> +    if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
> + >stripes[optimal].dev->dev_state))
> +    break;
> +    if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
> + >stripes[++optimal].dev->dev_state))
> +    break;
> +    optimal = first;

 you set optimal 2 times, the second one seems redundant.
>>>
>>>   No actually. When both the disks containing the stripe does not
>>>   have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to
>>>   use first found stripe.
>>
>> Yes, and the fact that you've already set optimal = first right after
>> BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again set
>> optimal right before the final break? What am I missing here?
> 
>   Ah. I think you are missing ++optimal in the 2nd if.

You are right, but I'd prefer you index the stripes array with 'optimal'
and 'optimal + 1' and leave just a single assignment

> 
> Thanks, Anand
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid

2018-01-31 Thread Anand Jain



On 01/31/2018 05:54 PM, Nikolay Borisov wrote:



On 31.01.2018 11:28, Anand Jain wrote:



On 01/31/2018 04:38 PM, Nikolay Borisov wrote:



On 30.01.2018 08:30, Anand Jain wrote:

Adds the mount option:
    mount -o read_mirror_policy=

To set the devid of the device which should be used for read. That
means all the normal reads will go to that particular device only.

This also helps testing and gives a better control for the test
scripts including mount context reads.


Some code comments below. OTOH, does such policy really make sense, what
happens if the selected device fails, will the other mirror be retried?


  Everything as usual, read_mirror_policy=devid just lets the user to
  specify his read optimized disk, so that we don't depend on the pid
  to pick a stripe mirrored disk, and instead we would pick as suggested
  by the user, and if that disk fails then we go back to the other mirror
  which may not be the read optimized disk as we have no other choice.


If the answer to the previous question is positive then why do we really
care which device is going to be tried first?


  It matters.
    - If you are reading from both disks alternatively, then it
  duplicates the LUN cache on the storage.
    - Some disks are read-optimized and using that for reading and going
  back to the other disk only when this disk fails provides a better
  overall read performance.


So usually this should be functionality handled by the raid/san
controller I guess, > but given that btrfs is playing the role of a
controller here at what point are we drawing the line of not
implementing block-level functionality into the filesystem ?


 Don't worry this is not invading into the block layer. How
 can you even build this functionality in the block layer ?
 Block layer even won't know that disks are mirrored. RAID
 does or BTRFS in our case.


::

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 39ba59832f38..478623e6e074 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5270,6 +5270,16 @@ static int find_live_mirror(struct
btrfs_fs_info *fs_info,
   num = map->num_stripes;
     switch(fs_info->read_mirror_policy) {
+    case BTRFS_READ_MIRROR_BY_DEV:
+    optimal = first;
+    if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
+ >stripes[optimal].dev->dev_state))
+    break;
+    if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
+ >stripes[++optimal].dev->dev_state))
+    break;
+    optimal = first;


you set optimal 2 times, the second one seems redundant.


  No actually. When both the disks containing the stripe does not
  have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to
  use first found stripe.


Yes, and the fact that you've already set optimal = first right after
BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again set
optimal right before the final break? What am I missing here?


  Ah. I think you are missing ++optimal in the 2nd if.

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid

2018-01-31 Thread Anand Jain



On 01/31/2018 04:38 PM, Nikolay Borisov wrote:



On 30.01.2018 08:30, Anand Jain wrote:

Adds the mount option:
   mount -o read_mirror_policy=

To set the devid of the device which should be used for read. That
means all the normal reads will go to that particular device only.

This also helps testing and gives a better control for the test
scripts including mount context reads.


Some code comments below. OTOH, does such policy really make sense, what
happens if the selected device fails, will the other mirror be retried?


 Everything as usual, read_mirror_policy=devid just lets the user to
 specify his read optimized disk, so that we don't depend on the pid
 to pick a stripe mirrored disk, and instead we would pick as suggested
 by the user, and if that disk fails then we go back to the other mirror
 which may not be the read optimized disk as we have no other choice.


If the answer to the previous question is positive then why do we really
care which device is going to be tried first?


 It matters.
   - If you are reading from both disks alternatively, then it
 duplicates the LUN cache on the storage.
   - Some disks are read-optimized and using that for reading and going
 back to the other disk only when this disk fails provides a better
 overall read performance.

::

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 39ba59832f38..478623e6e074 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5270,6 +5270,16 @@ static int find_live_mirror(struct btrfs_fs_info 
*fs_info,
num = map->num_stripes;
  
  	switch(fs_info->read_mirror_policy) {

+   case BTRFS_READ_MIRROR_BY_DEV:
+   optimal = first;
+   if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
+>stripes[optimal].dev->dev_state))
+   break;
+   if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
+>stripes[++optimal].dev->dev_state))
+   break;
+   optimal = first;


you set optimal 2 times, the second one seems redundant.


 No actually. When both the disks containing the stripe does not
 have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to
 use first found stripe.


Alongside this
patch it makes sense to also send a patch to btrfs(5) man page
describing the mount option + description of each implemented allocation
policy.


 Yep. Will do.


Another thing which I don't see here is how you are handling the case
when you have more than 2 devices in the RAID1 case. As it stands
currently you assume there are two devices and first test device 0 and
then device 1 and completely ignore any other devices.


 Not really. That part is already handled by the extent mapping.
 As the number of stripe for raid1 is two, the extent mapping will
 manage put related two devices of this stripe.

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid

2018-01-31 Thread Nikolay Borisov


On 30.01.2018 08:30, Anand Jain wrote:
> Adds the mount option:
>   mount -o read_mirror_policy=
> 
> To set the devid of the device which should be used for read. That
> means all the normal reads will go to that particular device only.
> 
> This also helps testing and gives a better control for the test
> scripts including mount context reads.

Some code comments below. OTOH, does such policy really make sense, what
happens if the selected device fails, will the other mirror be retried?
If the answer to the previous question is positive then why do we really
care which device is going to be tried first?

> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/super.c   | 21 +
>  fs/btrfs/volumes.c | 10 ++
>  fs/btrfs/volumes.h |  2 ++
>  3 files changed, 33 insertions(+)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index dfe6b3c67df3..d3aad87e 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -847,6 +847,27 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
> *options,
>   BTRFS_READ_MIRROR_BY_PID;
>   break;
>   }
> +
> + intarg = 0;
> + if (match_int([0], ) == 0) {
> + struct btrfs_device *device;
> +
> + device = btrfs_find_device(info, intarg,
> +NULL, NULL);
> + if (!device) {
> + btrfs_err(info,
> +   "read_mirror_policy: invalid devid 
> %d",
> +   intarg);
> + ret = -EINVAL;
> + goto out;
> + }
> + info->read_mirror_policy =
> + BTRFS_READ_MIRROR_BY_DEV;
> + set_bit(BTRFS_DEV_STATE_READ_MIRROR,
> + >dev_state);
> + break;
> + }
> +
>   ret = -EINVAL;
>   goto out;
>   case Opt_err:
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 39ba59832f38..478623e6e074 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct btrfs_fs_info 
> *fs_info,
>   num = map->num_stripes;
>  
>   switch(fs_info->read_mirror_policy) {
> + case BTRFS_READ_MIRROR_BY_DEV:
> + optimal = first;
> + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
> +  >stripes[optimal].dev->dev_state))
> + break;
> + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
> +  >stripes[++optimal].dev->dev_state))
> + break;
> + optimal = first;

you set optimal 2 times, the second one seems redundant. Alongside this
patch it makes sense to also send a patch to btrfs(5) man page
describing the mount option + description of each implemented allocation
policy.

Another thing which I don't see here is how you are handling the case
when you have more than 2 devices in the RAID1 case. As it stands
currently you assume there are two devices and first test device 0 and
then device 1 and completely ignore any other devices.

> + break;
>   case BTRFS_READ_MIRROR_DEFAULT:
>   case BTRFS_READ_MIRROR_BY_PID:
>   default:
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 78f35d299a61..7281f55dea05 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -50,6 +50,7 @@ struct btrfs_pending_bios {
>  enum btrfs_read_mirror_type {
>   BTRFS_READ_MIRROR_DEFAULT,
>   BTRFS_READ_MIRROR_BY_PID,
> + BTRFS_READ_MIRROR_BY_DEV,
>  };
>  
>  #define BTRFS_DEV_STATE_WRITEABLE(0)
> @@ -57,6 +58,7 @@ enum btrfs_read_mirror_type {
>  #define BTRFS_DEV_STATE_MISSING  (2)
>  #define BTRFS_DEV_STATE_REPLACE_TGT  (3)
>  #define BTRFS_DEV_STATE_FLUSH_SENT   (4)
> +#define BTRFS_DEV_STATE_READ_MIRROR  (5)
>  
>  struct btrfs_device {
>   struct list_head dev_list;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: add read_mirror_policy parameter devid

2018-01-29 Thread Anand Jain
Adds the mount option:
  mount -o read_mirror_policy=

To set the devid of the device which should be used for read. That
means all the normal reads will go to that particular device only.

This also helps testing and gives a better control for the test
scripts including mount context reads.

Signed-off-by: Anand Jain 
---
 fs/btrfs/super.c   | 21 +
 fs/btrfs/volumes.c | 10 ++
 fs/btrfs/volumes.h |  2 ++
 3 files changed, 33 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index dfe6b3c67df3..d3aad87e 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -847,6 +847,27 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
BTRFS_READ_MIRROR_BY_PID;
break;
}
+
+   intarg = 0;
+   if (match_int([0], ) == 0) {
+   struct btrfs_device *device;
+
+   device = btrfs_find_device(info, intarg,
+  NULL, NULL);
+   if (!device) {
+   btrfs_err(info,
+ "read_mirror_policy: invalid devid 
%d",
+ intarg);
+   ret = -EINVAL;
+   goto out;
+   }
+   info->read_mirror_policy =
+   BTRFS_READ_MIRROR_BY_DEV;
+   set_bit(BTRFS_DEV_STATE_READ_MIRROR,
+   >dev_state);
+   break;
+   }
+
ret = -EINVAL;
goto out;
case Opt_err:
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 39ba59832f38..478623e6e074 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5270,6 +5270,16 @@ static int find_live_mirror(struct btrfs_fs_info 
*fs_info,
num = map->num_stripes;
 
switch(fs_info->read_mirror_policy) {
+   case BTRFS_READ_MIRROR_BY_DEV:
+   optimal = first;
+   if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
+>stripes[optimal].dev->dev_state))
+   break;
+   if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
+>stripes[++optimal].dev->dev_state))
+   break;
+   optimal = first;
+   break;
case BTRFS_READ_MIRROR_DEFAULT:
case BTRFS_READ_MIRROR_BY_PID:
default:
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 78f35d299a61..7281f55dea05 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -50,6 +50,7 @@ struct btrfs_pending_bios {
 enum btrfs_read_mirror_type {
BTRFS_READ_MIRROR_DEFAULT,
BTRFS_READ_MIRROR_BY_PID,
+   BTRFS_READ_MIRROR_BY_DEV,
 };
 
 #define BTRFS_DEV_STATE_WRITEABLE  (0)
@@ -57,6 +58,7 @@ enum btrfs_read_mirror_type {
 #define BTRFS_DEV_STATE_MISSING(2)
 #define BTRFS_DEV_STATE_REPLACE_TGT(3)
 #define BTRFS_DEV_STATE_FLUSH_SENT (4)
+#define BTRFS_DEV_STATE_READ_MIRROR(5)
 
 struct btrfs_device {
struct list_head dev_list;
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html