Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
So usually this should be functionality handled by the raid/san controller I guess, > but given that btrfs is playing the role of a controller here at what point are we drawing the line of not implementing block-level functionality into the filesystem ? Don't worry this is not invading into the block layer. How can you even build this functionality in the block layer ? Block layer even won't know that disks are mirrored. RAID does or BTRFS in our case. By block layer I guess I meant the storage driver of a particular raid card. Because what is currently happening is re-implementing functionality that will generally sit in the driver. So my question was more generic and high-level - at what point do we draw the line of implementing feature that are generally implemented in hardware devices (be it their drivers or firmware). Not all HW configs use RAID capable HBAs. A server connected to a SATA JBOD using a SATA HBA without MD will relay on BTRFS to provide all the features and capabilities that otherwise would have provided by such a presumable HW config. That does sort of sound like means implementing some portion of the HBA features/capabilities in the filesystem. To me it seems this this could be workable at the fs level, provided it deals just with policies and remains hardware-neutral. Thanks. Ok. However most of the use cases appear to involve some hardware-dependent knowledge or assumptions. What happens when someone sets this on a virtual disk, or say a (persistent) memory-backed block device? Do you have any policy in particular ? No, this is your proposal ;^) Policy added here: devid It is about the devid which is assigned by the btrfs. Future policy: LBA/ssd/io/Tims-heuristic They aren't hardware dependent though ssd says use ssd disk for reading if available. LBA is to divide the read IO access based on the sector #. The logic is quite simple read-sector < FS-SIZE/2 ? mirror1 : mirror2; You've said cases #3 thru #6 are illustrative only. However they make assumptions about the underlying storage, and/or introduce potential for unexpected behaviors. The assumptions I am making is that user will understand their storage and tune this parameter accordingly, and there is heuristic (which Tim wrote) to do things automatically. Sometimes manual settings provide better performance than heuristic. Plus they could end up replicating functionality from other layers as Nikolay pointed out. Seems unlikely these would be practical to implement. The I/O one would actually be rather nice to have and wouldn't really be duplicating anything (at least, not duplicating anything we consistently run on top of). The pid-based selector works fine for cases where the only thing on the disks is a single BTRFS filesystem. When there's more than that, it can very easily result in highly asymmetrical load on the disks because it doesn't account for current I/O load when picking a copy to read. Last I checked, both MD and DM-RAID have at least the option to use I/O load in determining where to send reads for RAID1 setups, and they do a far better job than BTRFS at balancing load in these cases. Yeah.. some enterprise FS and storage communicate performance tunability automatically between each other. We will be there too. Case #2 seems concerning if it exposes internal, implementation-dependent filesystem data into a de facto user-level interface. (Do we ensure the devid is unique, and cannot get changed or re-assigned internally to a different device, etc?) The devid gets assigned when a device is added to a filesystem, it's a monotonically increasing number that gets incremented for every new device, and never changes for a given device as long as it remains in the filesystem (it will change if you remove the device and then re-add it). The only exception to this is that the replace command will assign the new device the same devid that the device it is replacing had (which I would argue leads to consistent behavior here). Given that, I think it's sufficiently safe to use it for something like this. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
On 2018-02-01 18:46, Edmund Nadolski wrote: On 02/01/2018 01:12 AM, Anand Jain wrote: On 02/01/2018 01:26 PM, Edmund Nadolski wrote: On 1/31/18 7:36 AM, Anand Jain wrote: On 01/31/2018 09:42 PM, Nikolay Borisov wrote: So usually this should be functionality handled by the raid/san controller I guess, > but given that btrfs is playing the role of a controller here at what point are we drawing the line of not implementing block-level functionality into the filesystem ? Don't worry this is not invading into the block layer. How can you even build this functionality in the block layer ? Block layer even won't know that disks are mirrored. RAID does or BTRFS in our case. By block layer I guess I meant the storage driver of a particular raid card. Because what is currently happening is re-implementing functionality that will generally sit in the driver. So my question was more generic and high-level - at what point do we draw the line of implementing feature that are generally implemented in hardware devices (be it their drivers or firmware). Not all HW configs use RAID capable HBAs. A server connected to a SATA JBOD using a SATA HBA without MD will relay on BTRFS to provide all the features and capabilities that otherwise would have provided by such a presumable HW config. That does sort of sound like means implementing some portion of the HBA features/capabilities in the filesystem. To me it seems this this could be workable at the fs level, provided it deals just with policies and remains hardware-neutral. Thanks. Ok. However most of the use cases appear to involve some hardware-dependent knowledge or assumptions. What happens when someone sets this on a virtual disk, or say a (persistent) memory-backed block device? Do you have any policy in particular ? No, this is your proposal ;^) You've said cases #3 thru #6 are illustrative only. However they make assumptions about the underlying storage, and/or introduce potential for unexpected behaviors. Plus they could end up replicating functionality from other layers as Nikolay pointed out. Seems unlikely these would be practical to implement. The I/O one would actually be rather nice to have and wouldn't really be duplicating anything (at least, not duplicating anything we consistently run on top of). The pid-based selector works fine for cases where the only thing on the disks is a single BTRFS filesystem. When there's more than that, it can very easily result in highly asymmetrical load on the disks because it doesn't account for current I/O load when picking a copy to read. Last I checked, both MD and DM-RAID have at least the option to use I/O load in determining where to send reads for RAID1 setups, and they do a far better job than BTRFS at balancing load in these cases. Case #2 seems concerning if it exposes internal, implementation-dependent filesystem data into a de facto user-level interface. (Do we ensure the devid is unique, and cannot get changed or re-assigned internally to a different device, etc?) The devid gets assigned when a device is added to a filesystem, it's a monotonically increasing number that gets incremented for every new device, and never changes for a given device as long as it remains in the filesystem (it will change if you remove the device and then re-add it). The only exception to this is that the replace command will assign the new device the same devid that the device it is replacing had (which I would argue leads to consistent behavior here). Given that, I think it's sufficiently safe to use it for something like this. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
On 02/01/2018 01:12 AM, Anand Jain wrote: > > > On 02/01/2018 01:26 PM, Edmund Nadolski wrote: >> On 1/31/18 7:36 AM, Anand Jain wrote: >>> >>> >>> On 01/31/2018 09:42 PM, Nikolay Borisov wrote: >>> >>> >> So usually this should be functionality handled by the raid/san >> controller I guess, > but given that btrfs is playing the role of a >> controller here at what point are we drawing the line of not >> implementing block-level functionality into the filesystem ? > >Don't worry this is not invading into the block layer. How >can you even build this functionality in the block layer ? >Block layer even won't know that disks are mirrored. RAID >does or BTRFS in our case. > By block layer I guess I meant the storage driver of a particular raid card. Because what is currently happening is re-implementing functionality that will generally sit in the driver. So my question was more generic and high-level - at what point do we draw the line of implementing feature that are generally implemented in hardware devices (be it their drivers or firmware). >>> >>> Not all HW configs use RAID capable HBAs. A server connected to a SATA >>> JBOD using a SATA HBA without MD will relay on BTRFS to provide all >>> the >>> features and capabilities that otherwise would have provided by such a >>> presumable HW config. >> >> That does sort of sound like means implementing some portion of the >> HBA features/capabilities in the filesystem. >> >> To me it seems this this could be workable at the fs level, provided it >> deals just with policies and remains hardware-neutral. > > Thanks. Ok. > >> However most >> of the use cases appear to involve some hardware-dependent knowledge >> or assumptions. > >> What happens when someone sets this on a virtual disk, >> or say a (persistent) memory-backed block device? > > Do you have any policy in particular ? No, this is your proposal ;^) You've said cases #3 thru #6 are illustrative only. However they make assumptions about the underlying storage, and/or introduce potential for unexpected behaviors. Plus they could end up replicating functionality from other layers as Nikolay pointed out. Seems unlikely these would be practical to implement. Case #2 seems concerning if it exposes internal, implementation-dependent filesystem data into a de facto user-level interface. (Do we ensure the devid is unique, and cannot get changed or re-assigned internally to a different device, etc?) Thanks, Ed >> Case #6 seems to >> open up some potential for unexpected interactions (which may be hard >> to reproduce, esp. in error/recovery scenarios). > > Yep. Even the #1 pid based (current default) which motivated > me to provide the devid based policy. > >> Case #2 takes a devid, but I notice btrfs_device::devid says, "the >> internal btrfs device id". How does a user obtain that internal value >> so it can be set as a mount option? > > btrfs fi show -m > > Thanks, Anand > >> Thanks, >> Ed >> >> >>> :: > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 39ba59832f38..478623e6e074 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct > btrfs_fs_info *fs_info, > num = map->num_stripes; >switch(fs_info->read_mirror_policy) { > +case BTRFS_READ_MIRROR_BY_DEV: > +optimal = first; > +if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, > + >stripes[optimal].dev->dev_state)) > +break; > +if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, > + >stripes[++optimal].dev->dev_state)) > +break; > +optimal = first; you set optimal 2 times, the second one seems redundant. >>> >>> No actually. When both the disks containing the stripe does not >>> have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to >>> use first found stripe. >> >> Yes, and the fact that you've already set optimal = first right after >> BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again >> set >> optimal right before the final break? What am I missing here? > > Ah. I think you are missing ++optimal in the 2nd if. You are right, but I'd prefer you index the stripes array with 'optimal' and 'optimal + 1' and leave just a single assignment >>> >>> Ok. Will improve that. >>> >>> Thanks, Anand >>> >>> > > Thanks, Anand > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html >>> -- >>> To unsubscribe from this list: send
Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
On 02/01/2018 01:26 PM, Edmund Nadolski wrote: On 1/31/18 7:36 AM, Anand Jain wrote: On 01/31/2018 09:42 PM, Nikolay Borisov wrote: So usually this should be functionality handled by the raid/san controller I guess, > but given that btrfs is playing the role of a controller here at what point are we drawing the line of not implementing block-level functionality into the filesystem ? Don't worry this is not invading into the block layer. How can you even build this functionality in the block layer ? Block layer even won't know that disks are mirrored. RAID does or BTRFS in our case. By block layer I guess I meant the storage driver of a particular raid card. Because what is currently happening is re-implementing functionality that will generally sit in the driver. So my question was more generic and high-level - at what point do we draw the line of implementing feature that are generally implemented in hardware devices (be it their drivers or firmware). Not all HW configs use RAID capable HBAs. A server connected to a SATA JBOD using a SATA HBA without MD will relay on BTRFS to provide all the features and capabilities that otherwise would have provided by such a presumable HW config. That does sort of sound like means implementing some portion of the HBA features/capabilities in the filesystem. To me it seems this this could be workable at the fs level, provided it deals just with policies and remains hardware-neutral. Thanks. Ok. However most of the use cases appear to involve some hardware-dependent knowledge or assumptions. What happens when someone sets this on a virtual disk, or say a (persistent) memory-backed block device? Do you have any policy in particular ? Case #6 seems to open up some potential for unexpected interactions (which may be hard to reproduce, esp. in error/recovery scenarios). Yep. Even the #1 pid based (current default) which motivated me to provide the devid based policy. Case #2 takes a devid, but I notice btrfs_device::devid says, "the internal btrfs device id". How does a user obtain that internal value so it can be set as a mount option? btrfs fi show -m Thanks, Anand Thanks, Ed :: diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 39ba59832f38..478623e6e074 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info, num = map->num_stripes; switch(fs_info->read_mirror_policy) { + case BTRFS_READ_MIRROR_BY_DEV: + optimal = first; + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, + >stripes[optimal].dev->dev_state)) + break; + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, + >stripes[++optimal].dev->dev_state)) + break; + optimal = first; you set optimal 2 times, the second one seems redundant. No actually. When both the disks containing the stripe does not have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to use first found stripe. Yes, and the fact that you've already set optimal = first right after BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again set optimal right before the final break? What am I missing here? Ah. I think you are missing ++optimal in the 2nd if. You are right, but I'd prefer you index the stripes array with 'optimal' and 'optimal + 1' and leave just a single assignment Ok. Will improve that. Thanks, Anand Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
On 1/31/18 7:36 AM, Anand Jain wrote: > > > On 01/31/2018 09:42 PM, Nikolay Borisov wrote: > > So usually this should be functionality handled by the raid/san controller I guess, > but given that btrfs is playing the role of a controller here at what point are we drawing the line of not implementing block-level functionality into the filesystem ? >>> >>> Don't worry this is not invading into the block layer. How >>> can you even build this functionality in the block layer ? >>> Block layer even won't know that disks are mirrored. RAID >>> does or BTRFS in our case. >>> >> >> By block layer I guess I meant the storage driver of a particular raid >> card. Because what is currently happening is re-implementing >> functionality that will generally sit in the driver. So my question was >> more generic and high-level - at what point do we draw the line of >> implementing feature that are generally implemented in hardware devices >> (be it their drivers or firmware). > > Not all HW configs use RAID capable HBAs. A server connected to a SATA > JBOD using a SATA HBA without MD will relay on BTRFS to provide all the > features and capabilities that otherwise would have provided by such a > presumable HW config. That does sort of sound like means implementing some portion of the HBA features/capabilities in the filesystem. To me it seems this this could be workable at the fs level, provided it deals just with policies and remains hardware-neutral. However most of the use cases appear to involve some hardware-dependent knowledge or assumptions. What happens when someone sets this on a virtual disk, or say a (persistent) memory-backed block device? Case #6 seems to open up some potential for unexpected interactions (which may be hard to reproduce, esp. in error/recovery scenarios). Case #2 takes a devid, but I notice btrfs_device::devid says, "the internal btrfs device id". How does a user obtain that internal value so it can be set as a mount option? Thanks, Ed > :: >>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >>> index 39ba59832f38..478623e6e074 100644 >>> --- a/fs/btrfs/volumes.c >>> +++ b/fs/btrfs/volumes.c >>> @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct >>> btrfs_fs_info *fs_info, >>> num = map->num_stripes; >>> switch(fs_info->read_mirror_policy) { >>> + case BTRFS_READ_MIRROR_BY_DEV: >>> + optimal = first; >>> + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, >>> + >stripes[optimal].dev->dev_state)) >>> + break; >>> + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, >>> + >stripes[++optimal].dev->dev_state)) >>> + break; >>> + optimal = first; >> >> you set optimal 2 times, the second one seems redundant. > > No actually. When both the disks containing the stripe does not > have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to > use first found stripe. Yes, and the fact that you've already set optimal = first right after BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again set optimal right before the final break? What am I missing here? >>> >>> Ah. I think you are missing ++optimal in the 2nd if. >> >> You are right, but I'd prefer you index the stripes array with 'optimal' >> and 'optimal + 1' and leave just a single assignment > > Ok. Will improve that. > > Thanks, Anand > > >>> >>> Thanks, Anand >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
On 01/31/2018 09:42 PM, Nikolay Borisov wrote: So usually this should be functionality handled by the raid/san controller I guess, > but given that btrfs is playing the role of a controller here at what point are we drawing the line of not implementing block-level functionality into the filesystem ? Don't worry this is not invading into the block layer. How can you even build this functionality in the block layer ? Block layer even won't know that disks are mirrored. RAID does or BTRFS in our case. By block layer I guess I meant the storage driver of a particular raid card. Because what is currently happening is re-implementing functionality that will generally sit in the driver. So my question was more generic and high-level - at what point do we draw the line of implementing feature that are generally implemented in hardware devices (be it their drivers or firmware). Not all HW configs use RAID capable HBAs. A server connected to a SATA JBOD using a SATA HBA without MD will relay on BTRFS to provide all the features and capabilities that otherwise would have provided by such a presumable HW config. :: :: diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 39ba59832f38..478623e6e074 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info, num = map->num_stripes; switch(fs_info->read_mirror_policy) { + case BTRFS_READ_MIRROR_BY_DEV: + optimal = first; + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, + >stripes[optimal].dev->dev_state)) + break; + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, + >stripes[++optimal].dev->dev_state)) + break; + optimal = first; you set optimal 2 times, the second one seems redundant. No actually. When both the disks containing the stripe does not have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to use first found stripe. Yes, and the fact that you've already set optimal = first right after BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again set optimal right before the final break? What am I missing here? Ah. I think you are missing ++optimal in the 2nd if. You are right, but I'd prefer you index the stripes array with 'optimal' and 'optimal + 1' and leave just a single assignment Ok. Will improve that. Thanks, Anand Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
On 31.01.2018 15:38, Anand Jain wrote: > > > On 01/31/2018 05:54 PM, Nikolay Borisov wrote: >> >> >> On 31.01.2018 11:28, Anand Jain wrote: >>> >>> >>> On 01/31/2018 04:38 PM, Nikolay Borisov wrote: On 30.01.2018 08:30, Anand Jain wrote: > Adds the mount option: > mount -o read_mirror_policy= > > To set the devid of the device which should be used for read. That > means all the normal reads will go to that particular device only. > > This also helps testing and gives a better control for the test > scripts including mount context reads. Some code comments below. OTOH, does such policy really make sense, what happens if the selected device fails, will the other mirror be retried? >>> >>> Everything as usual, read_mirror_policy=devid just lets the user to >>> specify his read optimized disk, so that we don't depend on the pid >>> to pick a stripe mirrored disk, and instead we would pick as suggested >>> by the user, and if that disk fails then we go back to the other >>> mirror >>> which may not be the read optimized disk as we have no other choice. >>> If the answer to the previous question is positive then why do we really care which device is going to be tried first? >>> >>> It matters. >>> - If you are reading from both disks alternatively, then it >>> duplicates the LUN cache on the storage. >>> - Some disks are read-optimized and using that for reading and going >>> back to the other disk only when this disk fails provides a better >>> overall read performance. >> >> So usually this should be functionality handled by the raid/san >> controller I guess, > but given that btrfs is playing the role of a >> controller here at what point are we drawing the line of not >> implementing block-level functionality into the filesystem ? > > Don't worry this is not invading into the block layer. How > can you even build this functionality in the block layer ? > Block layer even won't know that disks are mirrored. RAID > does or BTRFS in our case. > By block layer I guess I meant the storage driver of a particular raid card. Because what is currently happening is re-implementing functionality that will generally sit in the driver. So my question was more generic and high-level - at what point do we draw the line of implementing feature that are generally implemented in hardware devices (be it their drivers or firmware). >>> :: > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 39ba59832f38..478623e6e074 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct > btrfs_fs_info *fs_info, > num = map->num_stripes; > switch(fs_info->read_mirror_policy) { > + case BTRFS_READ_MIRROR_BY_DEV: > + optimal = first; > + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, > + >stripes[optimal].dev->dev_state)) > + break; > + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, > + >stripes[++optimal].dev->dev_state)) > + break; > + optimal = first; you set optimal 2 times, the second one seems redundant. >>> >>> No actually. When both the disks containing the stripe does not >>> have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to >>> use first found stripe. >> >> Yes, and the fact that you've already set optimal = first right after >> BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again set >> optimal right before the final break? What am I missing here? > > Ah. I think you are missing ++optimal in the 2nd if. You are right, but I'd prefer you index the stripes array with 'optimal' and 'optimal + 1' and leave just a single assignment > > Thanks, Anand > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
On 01/31/2018 05:54 PM, Nikolay Borisov wrote: On 31.01.2018 11:28, Anand Jain wrote: On 01/31/2018 04:38 PM, Nikolay Borisov wrote: On 30.01.2018 08:30, Anand Jain wrote: Adds the mount option: mount -o read_mirror_policy= To set the devid of the device which should be used for read. That means all the normal reads will go to that particular device only. This also helps testing and gives a better control for the test scripts including mount context reads. Some code comments below. OTOH, does such policy really make sense, what happens if the selected device fails, will the other mirror be retried? Everything as usual, read_mirror_policy=devid just lets the user to specify his read optimized disk, so that we don't depend on the pid to pick a stripe mirrored disk, and instead we would pick as suggested by the user, and if that disk fails then we go back to the other mirror which may not be the read optimized disk as we have no other choice. If the answer to the previous question is positive then why do we really care which device is going to be tried first? It matters. - If you are reading from both disks alternatively, then it duplicates the LUN cache on the storage. - Some disks are read-optimized and using that for reading and going back to the other disk only when this disk fails provides a better overall read performance. So usually this should be functionality handled by the raid/san controller I guess, > but given that btrfs is playing the role of a controller here at what point are we drawing the line of not implementing block-level functionality into the filesystem ? Don't worry this is not invading into the block layer. How can you even build this functionality in the block layer ? Block layer even won't know that disks are mirrored. RAID does or BTRFS in our case. :: diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 39ba59832f38..478623e6e074 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info, num = map->num_stripes; switch(fs_info->read_mirror_policy) { + case BTRFS_READ_MIRROR_BY_DEV: + optimal = first; + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, + >stripes[optimal].dev->dev_state)) + break; + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, + >stripes[++optimal].dev->dev_state)) + break; + optimal = first; you set optimal 2 times, the second one seems redundant. No actually. When both the disks containing the stripe does not have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to use first found stripe. Yes, and the fact that you've already set optimal = first right after BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again set optimal right before the final break? What am I missing here? Ah. I think you are missing ++optimal in the 2nd if. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
On 01/31/2018 04:38 PM, Nikolay Borisov wrote: On 30.01.2018 08:30, Anand Jain wrote: Adds the mount option: mount -o read_mirror_policy= To set the devid of the device which should be used for read. That means all the normal reads will go to that particular device only. This also helps testing and gives a better control for the test scripts including mount context reads. Some code comments below. OTOH, does such policy really make sense, what happens if the selected device fails, will the other mirror be retried? Everything as usual, read_mirror_policy=devid just lets the user to specify his read optimized disk, so that we don't depend on the pid to pick a stripe mirrored disk, and instead we would pick as suggested by the user, and if that disk fails then we go back to the other mirror which may not be the read optimized disk as we have no other choice. If the answer to the previous question is positive then why do we really care which device is going to be tried first? It matters. - If you are reading from both disks alternatively, then it duplicates the LUN cache on the storage. - Some disks are read-optimized and using that for reading and going back to the other disk only when this disk fails provides a better overall read performance. :: diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 39ba59832f38..478623e6e074 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info, num = map->num_stripes; switch(fs_info->read_mirror_policy) { + case BTRFS_READ_MIRROR_BY_DEV: + optimal = first; + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, +>stripes[optimal].dev->dev_state)) + break; + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, +>stripes[++optimal].dev->dev_state)) + break; + optimal = first; you set optimal 2 times, the second one seems redundant. No actually. When both the disks containing the stripe does not have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to use first found stripe. Alongside this patch it makes sense to also send a patch to btrfs(5) man page describing the mount option + description of each implemented allocation policy. Yep. Will do. Another thing which I don't see here is how you are handling the case when you have more than 2 devices in the RAID1 case. As it stands currently you assume there are two devices and first test device 0 and then device 1 and completely ignore any other devices. Not really. That part is already handled by the extent mapping. As the number of stripe for raid1 is two, the extent mapping will manage put related two devices of this stripe. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
On 30.01.2018 08:30, Anand Jain wrote: > Adds the mount option: > mount -o read_mirror_policy= > > To set the devid of the device which should be used for read. That > means all the normal reads will go to that particular device only. > > This also helps testing and gives a better control for the test > scripts including mount context reads. Some code comments below. OTOH, does such policy really make sense, what happens if the selected device fails, will the other mirror be retried? If the answer to the previous question is positive then why do we really care which device is going to be tried first? > > Signed-off-by: Anand Jain> --- > fs/btrfs/super.c | 21 + > fs/btrfs/volumes.c | 10 ++ > fs/btrfs/volumes.h | 2 ++ > 3 files changed, 33 insertions(+) > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index dfe6b3c67df3..d3aad87e 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -847,6 +847,27 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char > *options, > BTRFS_READ_MIRROR_BY_PID; > break; > } > + > + intarg = 0; > + if (match_int([0], ) == 0) { > + struct btrfs_device *device; > + > + device = btrfs_find_device(info, intarg, > +NULL, NULL); > + if (!device) { > + btrfs_err(info, > + "read_mirror_policy: invalid devid > %d", > + intarg); > + ret = -EINVAL; > + goto out; > + } > + info->read_mirror_policy = > + BTRFS_READ_MIRROR_BY_DEV; > + set_bit(BTRFS_DEV_STATE_READ_MIRROR, > + >dev_state); > + break; > + } > + > ret = -EINVAL; > goto out; > case Opt_err: > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 39ba59832f38..478623e6e074 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct btrfs_fs_info > *fs_info, > num = map->num_stripes; > > switch(fs_info->read_mirror_policy) { > + case BTRFS_READ_MIRROR_BY_DEV: > + optimal = first; > + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, > + >stripes[optimal].dev->dev_state)) > + break; > + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, > + >stripes[++optimal].dev->dev_state)) > + break; > + optimal = first; you set optimal 2 times, the second one seems redundant. Alongside this patch it makes sense to also send a patch to btrfs(5) man page describing the mount option + description of each implemented allocation policy. Another thing which I don't see here is how you are handling the case when you have more than 2 devices in the RAID1 case. As it stands currently you assume there are two devices and first test device 0 and then device 1 and completely ignore any other devices. > + break; > case BTRFS_READ_MIRROR_DEFAULT: > case BTRFS_READ_MIRROR_BY_PID: > default: > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h > index 78f35d299a61..7281f55dea05 100644 > --- a/fs/btrfs/volumes.h > +++ b/fs/btrfs/volumes.h > @@ -50,6 +50,7 @@ struct btrfs_pending_bios { > enum btrfs_read_mirror_type { > BTRFS_READ_MIRROR_DEFAULT, > BTRFS_READ_MIRROR_BY_PID, > + BTRFS_READ_MIRROR_BY_DEV, > }; > > #define BTRFS_DEV_STATE_WRITEABLE(0) > @@ -57,6 +58,7 @@ enum btrfs_read_mirror_type { > #define BTRFS_DEV_STATE_MISSING (2) > #define BTRFS_DEV_STATE_REPLACE_TGT (3) > #define BTRFS_DEV_STATE_FLUSH_SENT (4) > +#define BTRFS_DEV_STATE_READ_MIRROR (5) > > struct btrfs_device { > struct list_head dev_list; > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs: add read_mirror_policy parameter devid
Adds the mount option: mount -o read_mirror_policy= To set the devid of the device which should be used for read. That means all the normal reads will go to that particular device only. This also helps testing and gives a better control for the test scripts including mount context reads. Signed-off-by: Anand Jain--- fs/btrfs/super.c | 21 + fs/btrfs/volumes.c | 10 ++ fs/btrfs/volumes.h | 2 ++ 3 files changed, 33 insertions(+) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index dfe6b3c67df3..d3aad87e 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -847,6 +847,27 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, BTRFS_READ_MIRROR_BY_PID; break; } + + intarg = 0; + if (match_int([0], ) == 0) { + struct btrfs_device *device; + + device = btrfs_find_device(info, intarg, + NULL, NULL); + if (!device) { + btrfs_err(info, + "read_mirror_policy: invalid devid %d", + intarg); + ret = -EINVAL; + goto out; + } + info->read_mirror_policy = + BTRFS_READ_MIRROR_BY_DEV; + set_bit(BTRFS_DEV_STATE_READ_MIRROR, + >dev_state); + break; + } + ret = -EINVAL; goto out; case Opt_err: diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 39ba59832f38..478623e6e074 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info, num = map->num_stripes; switch(fs_info->read_mirror_policy) { + case BTRFS_READ_MIRROR_BY_DEV: + optimal = first; + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, +>stripes[optimal].dev->dev_state)) + break; + if (test_bit(BTRFS_DEV_STATE_READ_MIRROR, +>stripes[++optimal].dev->dev_state)) + break; + optimal = first; + break; case BTRFS_READ_MIRROR_DEFAULT: case BTRFS_READ_MIRROR_BY_PID: default: diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 78f35d299a61..7281f55dea05 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -50,6 +50,7 @@ struct btrfs_pending_bios { enum btrfs_read_mirror_type { BTRFS_READ_MIRROR_DEFAULT, BTRFS_READ_MIRROR_BY_PID, + BTRFS_READ_MIRROR_BY_DEV, }; #define BTRFS_DEV_STATE_WRITEABLE (0) @@ -57,6 +58,7 @@ enum btrfs_read_mirror_type { #define BTRFS_DEV_STATE_MISSING(2) #define BTRFS_DEV_STATE_REPLACE_TGT(3) #define BTRFS_DEV_STATE_FLUSH_SENT (4) +#define BTRFS_DEV_STATE_READ_MIRROR(5) struct btrfs_device { struct list_head dev_list; -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html