Re: [PATCH V2 4/4] btrfs: cleanup barrier_all_devices() to check dev stat flush error

2017-03-29 Thread Anand Jain



On 03/29/2017 12:19 AM, David Sterba wrote:

On Tue, Mar 14, 2017 at 04:26:11PM +0800, Anand Jain wrote:

The objective of this patch is to cleanup barrier_all_devices()
so that the error checking is in a separate loop independent of
of the loop which submits and waits on the device flush requests.


I think that getting completely rid of the ENOMEM as suggested under
patch 2, the split would not be necessary.

The send failures will be gone so we can simply count failures from the
waiting write_dev_flush(..., 1). There any returned error also implies
the device stat increase for FLUSH_ERRS. Then barrier_all_devices will
be a bit simpler.


 However we need per device flush error, so we can deduce if
 the volume is degrade-mountable [1]
 (and so the temporary shelter for the per-device error was necessary).

 [1] btrfs: Introduce a function to check if all chunks a OK for 
degraded rw mount




By doing this it helps to further develop patches which would tune
the error-actions as needed.

Here functions such as btrfs_dev_stats_dirty() couldn't be used
because it doesn't monitor the flush errors BTRFS_DEV_STAT_FLUSH_ERRS.

Signed-off-by: Anand Jain 
---
v2: Address Qu review comments viz..
 Add meaningful names, like cp_list (for checkpoint_list head).
 (And actually it does not need a new struct type just to hold
  the head pointer, list node is already named as device_checkpoint).
 Check return value of add_device_checkpoint()
 Check if the device is already added at add_device_checkpoint()
 Rename fini_devices_checkpoint() to rel_devices_checkpoint()

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5719e036048b..d0c401884643 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3566,6 +3566,86 @@ static int write_dev_flush(struct btrfs_device *device, 
int wait)
return 0;
 }

+struct device_checkpoint {
+   struct list_head cp_node;
+   struct btrfs_device *device;
+   int stat_value_checkpoint;
+};


I find this approach particularly bad for several reasons.

You need to store just one int per device but introduce a bloated
temporary structure. If anything, the int could be embeded in
btrfs_device, set and checked at the right points. Instead, you add
memory allocations to a critical context and more potential failure
paths.


 Right. As of now we are using dev_stats_ccnt to flag if the dev_stat
 contents have changed in general, however its not specific to one of
 the 5 error in which flush error is one among them. Here we are
 tracking only the flush errors occurred in this-commit do-barrier
 request.

 I think one other idea is to track flush error similar to
 dev_stats_ccnt but limited only to flush error. However in the long
 term we may end up similarly tracking the other errors too,
 and I was bit concerned about that, so limited the changes to local.
 But now I think its good to try if you are ok, Unless there is any
 other better way ?

 Thanks for pointing out other misses as below, the above new plan
 will fix them as well.

Thanks, Anand



+static int add_device_checkpoint(struct btrfs_device *device,
+   struct list_head *cp_list)
+{

...

+   cdev = kzalloc(sizeof(struct device_checkpoint), GFP_KERNEL);

...

+}


The memory allocations are GFP_KERNEL, this could lead to strange
lockups if the allocator tries to free memory and asks the filesystem(s)
to flush their data.

Traversing the structures leads to quadratic complexity in
check_barrier_error(), where the checkpoint list is traversed for
each iteration of device list.




@@ -3574,8 +3654,8 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 {
struct list_head *head;
struct btrfs_device *dev;
-   int dropouts = 0;
int ret;
+   static LIST_HEAD(cp_list);

^^

What is this supposed to mean? What if several filesystems end up in
barrier_all_devices modifying the list?


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 4/4] btrfs: cleanup barrier_all_devices() to check dev stat flush error

2017-03-28 Thread David Sterba
On Tue, Mar 14, 2017 at 04:26:11PM +0800, Anand Jain wrote:
> The objective of this patch is to cleanup barrier_all_devices()
> so that the error checking is in a separate loop independent of
> of the loop which submits and waits on the device flush requests.

I think that getting completely rid of the ENOMEM as suggested under
patch 2, the split would not be necessary.

The send failures will be gone so we can simply count failures from the
waiting write_dev_flush(..., 1). There any returned error also implies
the device stat increase for FLUSH_ERRS. Then barrier_all_devices will
be a bit simpler.

> By doing this it helps to further develop patches which would tune
> the error-actions as needed.
> 
> Here functions such as btrfs_dev_stats_dirty() couldn't be used
> because it doesn't monitor the flush errors BTRFS_DEV_STAT_FLUSH_ERRS.
> 
> Signed-off-by: Anand Jain 
> ---
> v2: Address Qu review comments viz..
>  Add meaningful names, like cp_list (for checkpoint_list head).
>  (And actually it does not need a new struct type just to hold
>   the head pointer, list node is already named as device_checkpoint).
>  Check return value of add_device_checkpoint()
>  Check if the device is already added at add_device_checkpoint()
>  Rename fini_devices_checkpoint() to rel_devices_checkpoint()
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 5719e036048b..d0c401884643 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3566,6 +3566,86 @@ static int write_dev_flush(struct btrfs_device 
> *device, int wait)
>   return 0;
>  }
>  
> +struct device_checkpoint {
> + struct list_head cp_node;
> + struct btrfs_device *device;
> + int stat_value_checkpoint;
> +};

I find this approach particularly bad for several reasons.

You need to store just one int per device but introduce a bloated
temporary structure. If anything, the int could be embeded in
btrfs_device, set and checked at the right points. Instead, you add
memory allocations to a critical context and more potential failure
paths.

> +static int add_device_checkpoint(struct btrfs_device *device,
> + struct list_head *cp_list)
> +{
...
> + cdev = kzalloc(sizeof(struct device_checkpoint), GFP_KERNEL);
...
> +}

The memory allocations are GFP_KERNEL, this could lead to strange
lockups if the allocator tries to free memory and asks the filesystem(s)
to flush their data.

Traversing the structures leads to quadratic complexity in
check_barrier_error(), where the checkpoint list is traversed for
each iteration of device list.

> @@ -3574,8 +3654,8 @@ static int barrier_all_devices(struct btrfs_fs_info 
> *info)
>  {
>   struct list_head *head;
>   struct btrfs_device *dev;
> - int dropouts = 0;
>   int ret;
> + static LIST_HEAD(cp_list);
^^

What is this supposed to mean? What if several filesystems end up in
barrier_all_devices modifying the list?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 4/4] btrfs: cleanup barrier_all_devices() to check dev stat flush error

2017-03-14 Thread Qu Wenruo



At 03/14/2017 04:26 PM, Anand Jain wrote:

The objective of this patch is to cleanup barrier_all_devices()
so that the error checking is in a separate loop independent of
of the loop which submits and waits on the device flush requests.

By doing this it helps to further develop patches which would tune
the error-actions as needed.

Here functions such as btrfs_dev_stats_dirty() couldn't be used
because it doesn't monitor the flush errors BTRFS_DEV_STAT_FLUSH_ERRS.

Signed-off-by: Anand Jain 
---
v2: Address Qu review comments viz..
 Add meaningful names, like cp_list (for checkpoint_list head).
 (And actually it does not need a new struct type just to hold
  the head pointer, list node is already named as device_checkpoint).
 Check return value of add_device_checkpoint()
 Check if the device is already added at add_device_checkpoint()
 Rename fini_devices_checkpoint() to rel_devices_checkpoint()

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5719e036048b..d0c401884643 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3566,6 +3566,86 @@ static int write_dev_flush(struct btrfs_device *device, 
int wait)
return 0;
 }

+struct device_checkpoint {
+   struct list_head cp_node;
+   struct btrfs_device *device;
+   int stat_value_checkpoint;
+};
+
+static int add_device_checkpoint(struct btrfs_device *device,
+   struct list_head *cp_list)
+{
+   struct device_checkpoint *cdev;
+
+   list_for_each_entry(cdev, cp_list, cp_node) {
+   if (cdev->device == device) {
+   cdev->stat_value_checkpoint =
+   btrfs_dev_stat_read(device,
+   BTRFS_DEV_STAT_FLUSH_ERRS);
+   return 0;
+   }
+   }
+
+   cdev = kzalloc(sizeof(struct device_checkpoint), GFP_KERNEL);
+   if (!cdev)
+   return -ENOMEM;
+
+   list_add(>cp_node, cp_list);
+
+   cdev->device = device;
+   cdev->stat_value_checkpoint =
+   btrfs_dev_stat_read(device, BTRFS_DEV_STAT_FLUSH_ERRS);
+
+   return 0;
+}
+
+static void rel_devices_checkpoint(struct list_head *cp_list)


I meant "release_devices_checkpoint", 'fini' and 'rel' is too short for 
to be meaningful.


Despite of that, looks not too bad to me.

Although I prefer to do extra locking to protect the operation to make 
the checkpoint system more independent and robust, but it can be done later.


So far so good. Although I'm still not sure if other reviewers will have 
other comments.


Thanks,
Qu

+{
+   struct device_checkpoint *cdev;
+
+   while(!list_empty(cp_list)) {
+   cdev = list_entry(cp_list->next,
+   struct device_checkpoint, cp_node);
+   list_del(>cp_node);
+   kfree(cdev);
+   }
+}
+
+static int check_stat_flush(struct btrfs_device *dev,
+   struct list_head *cp_list)
+{
+   int val;
+   struct device_checkpoint *cdev;
+
+   list_for_each_entry(cdev, cp_list, cp_node) {
+   if (cdev->device == dev) {
+   val = btrfs_dev_stat_read(dev,
+   BTRFS_DEV_STAT_FLUSH_ERRS);
+   if (cdev->stat_value_checkpoint != val)
+   return 1;
+   }
+   }
+   return 0;
+}
+
+static int check_barrier_error(struct btrfs_fs_devices *fsdevs,
+   struct list_head *cp_list)
+{
+   int dropouts = 0;
+   struct btrfs_device *dev;
+
+   list_for_each_entry_rcu(dev, >devices, dev_list) {
+   if (!dev->bdev || check_stat_flush(dev, cp_list))
+   dropouts++;
+   }
+
+   if (dropouts >
+   fsdevs->fs_info->num_tolerated_disk_barrier_failures)
+   return -EIO;
+
+   return 0;
+}
+
 /*
  * send an empty flush down to each device in parallel,
  * then wait for them
@@ -3574,8 +3654,8 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 {
struct list_head *head;
struct btrfs_device *dev;
-   int dropouts = 0;
int ret;
+   static LIST_HEAD(cp_list);

/* send down all the barriers */
head = >fs_devices->devices;
@@ -3587,29 +3667,35 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
if (!dev->in_fs_metadata || !dev->writeable)
continue;

+   ret = add_device_checkpoint(dev, _list);
+   if (ret) {
+   rel_devices_checkpoint(_list);
+   return ret;
+   }
ret = write_dev_flush(dev, 0);
-   if (ret)
+   if (ret) {
+   rel_devices_checkpoint(_list);
return ret;
+   }
}

/* wait for all the 

[PATCH V2 4/4] btrfs: cleanup barrier_all_devices() to check dev stat flush error

2017-03-14 Thread Anand Jain
The objective of this patch is to cleanup barrier_all_devices()
so that the error checking is in a separate loop independent of
of the loop which submits and waits on the device flush requests.

By doing this it helps to further develop patches which would tune
the error-actions as needed.

Here functions such as btrfs_dev_stats_dirty() couldn't be used
because it doesn't monitor the flush errors BTRFS_DEV_STAT_FLUSH_ERRS.

Signed-off-by: Anand Jain 
---
v2: Address Qu review comments viz..
 Add meaningful names, like cp_list (for checkpoint_list head).
 (And actually it does not need a new struct type just to hold
  the head pointer, list node is already named as device_checkpoint).
 Check return value of add_device_checkpoint()
 Check if the device is already added at add_device_checkpoint()
 Rename fini_devices_checkpoint() to rel_devices_checkpoint()

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5719e036048b..d0c401884643 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3566,6 +3566,86 @@ static int write_dev_flush(struct btrfs_device *device, 
int wait)
return 0;
 }
 
+struct device_checkpoint {
+   struct list_head cp_node;
+   struct btrfs_device *device;
+   int stat_value_checkpoint;
+};
+
+static int add_device_checkpoint(struct btrfs_device *device,
+   struct list_head *cp_list)
+{
+   struct device_checkpoint *cdev;
+
+   list_for_each_entry(cdev, cp_list, cp_node) {
+   if (cdev->device == device) {
+   cdev->stat_value_checkpoint =
+   btrfs_dev_stat_read(device,
+   BTRFS_DEV_STAT_FLUSH_ERRS);
+   return 0;
+   }
+   }
+
+   cdev = kzalloc(sizeof(struct device_checkpoint), GFP_KERNEL);
+   if (!cdev)
+   return -ENOMEM;
+
+   list_add(>cp_node, cp_list);
+
+   cdev->device = device;
+   cdev->stat_value_checkpoint =
+   btrfs_dev_stat_read(device, BTRFS_DEV_STAT_FLUSH_ERRS);
+
+   return 0;
+}
+
+static void rel_devices_checkpoint(struct list_head *cp_list)
+{
+   struct device_checkpoint *cdev;
+
+   while(!list_empty(cp_list)) {
+   cdev = list_entry(cp_list->next,
+   struct device_checkpoint, cp_node);
+   list_del(>cp_node);
+   kfree(cdev);
+   }
+}
+
+static int check_stat_flush(struct btrfs_device *dev,
+   struct list_head *cp_list)
+{
+   int val;
+   struct device_checkpoint *cdev;
+
+   list_for_each_entry(cdev, cp_list, cp_node) {
+   if (cdev->device == dev) {
+   val = btrfs_dev_stat_read(dev,
+   BTRFS_DEV_STAT_FLUSH_ERRS);
+   if (cdev->stat_value_checkpoint != val)
+   return 1;
+   }
+   }
+   return 0;
+}
+
+static int check_barrier_error(struct btrfs_fs_devices *fsdevs,
+   struct list_head *cp_list)
+{
+   int dropouts = 0;
+   struct btrfs_device *dev;
+
+   list_for_each_entry_rcu(dev, >devices, dev_list) {
+   if (!dev->bdev || check_stat_flush(dev, cp_list))
+   dropouts++;
+   }
+
+   if (dropouts >
+   fsdevs->fs_info->num_tolerated_disk_barrier_failures)
+   return -EIO;
+
+   return 0;
+}
+
 /*
  * send an empty flush down to each device in parallel,
  * then wait for them
@@ -3574,8 +3654,8 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 {
struct list_head *head;
struct btrfs_device *dev;
-   int dropouts = 0;
int ret;
+   static LIST_HEAD(cp_list);
 
/* send down all the barriers */
head = >fs_devices->devices;
@@ -3587,29 +3667,35 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
if (!dev->in_fs_metadata || !dev->writeable)
continue;
 
+   ret = add_device_checkpoint(dev, _list);
+   if (ret) {
+   rel_devices_checkpoint(_list);
+   return ret;
+   }
ret = write_dev_flush(dev, 0);
-   if (ret)
+   if (ret) {
+   rel_devices_checkpoint(_list);
return ret;
+   }
}
 
/* wait for all the barriers */
list_for_each_entry_rcu(dev, head, dev_list) {
if (dev->missing)
continue;
-   if (!dev->bdev) {
-   dropouts++;
+   if (!dev->bdev)
continue;
-   }
if (!dev->in_fs_metadata || !dev->writeable)
continue;
 
-   ret = write_dev_flush(dev, 1);
-