from:"Akinobu Mita"

Re: [PATCH] fault_inject: Replace DEFINE_SIMPLE_ATTRIBUTE with DEFINE_DEBUGFS_ATTRIBUTE

2021-02-02 Thread Akinobu Mita

2021年2月1日(月) 16:43 Jiapeng Chong :
>
> Fix the following coccicheck warning:
>
> ./lib/fault-inject.c:187:0-23: WARNING: fops_stacktrace_depth should be
> defined with DEFINE_DEBUGFS_ATTRIBUTE.
>
> ./lib/fault-inject.c:169:0-23: WARNING: fops_ul should be defined with
> DEFINE_DEBUGFS_ATTRIBUTE.
>
> Reported-by: Abaci Robot 
> Signed-off-by: Jiapeng Chong 
> ---
>  lib/fault-inject.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/lib/fault-inject.c b/lib/fault-inject.c
> index ce12621..cb7ea22 100644
> --- a/lib/fault-inject.c
> +++ b/lib/fault-inject.c
> @@ -166,7 +166,7 @@ static int debugfs_ul_get(void *data, u64 *val)
> return 0;
>  }
>
> -DEFINE_SIMPLE_ATTRIBUTE(fops_ul, debugfs_ul_get, debugfs_ul_set, "%llu\n");
> +DEFINE_DEBUGFS_ATTRIBUTE(fops_ul, debugfs_ul_get, debugfs_ul_set, "%llu\n");
>
>  static void debugfs_create_ul(const char *name, umode_t mode,
>   struct dentry *parent, unsigned long *value)

Could you just remove this fops_ul stuff and use debugfs_create_ulong() instead?

Re: [RFC PATCH v2 2/2] docs: add fail_lsm_hooks info to fault-injection.rst

2020-10-27 Thread Akinobu Mita

2020年10月26日(月) 21:52 Aleksandr Nogikh :
>
> From: Aleksandr Nogikh 
>
> Describe fail_lsm_hooks fault injection capability.
>
> Signed-off-by: Aleksandr Nogikh 
> ---
> v2:
> - Added this commit.
> ---
>  Documentation/fault-injection/fault-injection.rst | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/Documentation/fault-injection/fault-injection.rst 
> b/Documentation/fault-injection/fault-injection.rst
> index 31ecfe44e5b4..48705adfbc18 100644
> --- a/Documentation/fault-injection/fault-injection.rst
> +++ b/Documentation/fault-injection/fault-injection.rst
> @@ -48,6 +48,12 @@ Available fault injection capabilities
>status code is NVME_SC_INVALID_OPCODE with no retry. The status code and
>retry flag can be set via the debugfs.
>
> +- fail_lsm_hooks
> +
> +  injects failures into LSM hooks. When a fault is injected, actual hooks
> +  are not executed and a code from /sys/kernel/debug/fail_lsm_hooks/retval
> +  is returned (the default value is -EACCES).

In addition to this global one, what do you think about per-hook fault
injection,
i.e. /sys/kernel/debug/fail_lsm_hooks//retval ?

In this case, we need a fault_attr for each hook. (Maybe, we can use the same
technique that is used to define security_hook_heads).

Re: [PATCH] fault-injection: handle EI_ETYPE_TRUE

2020-10-14 Thread Akinobu Mita

Hi Andrew,

Please consider taking this patch in the -mm tree.

This patch looks good to me.

Reviewed-by: Akinobu Mita 

2020年10月13日(火) 18:31 Barnabás Pőcze :
>
> Hi,
>
> I had some difficulty finding who should receive this patch, and I am not
> sure I got it right. Could someone please confirm that any of you
> can take this patch, or should I resend it? (In that case, to whom?)
>
>
> Thank you,
> Barnabás Pőcze
>
>
> > Commit af3b854492f351d1ff3b4744a83bf5ff7eed4920
> > ("mm/page_alloc.c: allow error injection")
> > introduced EI_ETYPE_TRUE, but did not extend
> >
> > -   lib/error-inject.c:error_type_string(), and
> > -   kernel/fail_function.c:adjust_error_retval()
> > to accommodate for this change.
> >
> > Handle EI_ETYPE_TRUE in both functions appropriately by
> >
> > -   returning "TRUE" in error_type_string(),
> > -   adjusting the return value to true (1) in adjust_error_retval().
> >
> > Furthermore, simplify the logic of handling EI_ETYPE_NULL
> > in adjust_error_retval().
> >
> > Signed-off-by: Barnabás Pőcze po...@protonmail.com
> >
> >
> > kernel/fail_function.c | 6 +++---
> > lib/error-inject.c | 2 ++
> > 2 files changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/fail_function.c b/kernel/fail_function.c
> > index 63b349168da7..4fdea01c0561 100644
> > --- a/kernel/fail_function.c
> > +++ b/kernel/fail_function.c
> > @@ -37,9 +37,7 @@ static unsigned long adjust_error_retval(unsigned long 
> > addr, unsigned long retv)
> > {
> > switch (get_injectable_error_type(addr)) {
> > case EI_ETYPE_NULL:
> >
> > - if (retv != 0)
> >
> >
> > - return 0;
> >
> >
> > - break;
> >
> >
> >
> > - return 0;
> >
> >
> > case EI_ETYPE_ERRNO:
> > if (retv < (unsigned long)-MAX_ERRNO)
> > return (unsigned long)-EINVAL;
> > @@ -48,6 +46,8 @@ static unsigned long adjust_error_retval(unsigned 
> > long addr, unsigned long retv)
> > if (retv != 0 && retv < (unsigned long)-MAX_ERRNO)
> > return (unsigned long)-EINVAL;
> > break;
> >
> > -   case EI_ETYPE_TRUE:
> > - return 1;
> >
> >
> > }
> >
> > return retv;
> > diff --git a/lib/error-inject.c b/lib/error-inject.c
> > index aa63751c916f..c73651b15b76 100644
> > --- a/lib/error-inject.c
> > +++ b/lib/error-inject.c
> > @@ -180,6 +180,8 @@ static const char *error_type_string(int etype)
> > return "ERRNO";
> > case EI_ETYPE_ERRNO_NULL:
> > return "ERRNO_NULL";
> >
> > -   case EI_ETYPE_TRUE:
> > - return "TRUE";
> >
> >
> > default:
> > return "(unknown)";
> > }
> > --
> > 2.28.0
> >

Re: [PATCH v2 0/3] add fault injection to user memory access

2020-08-31 Thread Akinobu Mita

Andrew,

Could you take a look at this series, and consider taking in -mm tree?

2020年9月1日(火) 0:49 Alexander Potapenko :
>
> > This series looks good to me.
>
> Great!
>
> Which tree do fault injection patches normally go to?
>
> > Reviewed-by: Akinobu Mita 
>
> Reviewed-by: Alexander Potapenko

Re: [PATCH v2 0/3] add fault injection to user memory access

2020-08-31 Thread Akinobu Mita

2020年8月28日(金) 23:14 :
>
> From: Albert van der Linde 
>
> The goal of this series is to improve testing of fault-tolerance in
> usages of user memory access functions, by adding support for fault
> injection.
>
> The first patch adds failure injection capability for usercopy
> functions. The second changes usercopy functions to use this new failure
> capability (copy_from_user, ...). The third patch adds
> get/put/clear_user failures to x86.

This series looks good to me.

Reviewed-by: Akinobu Mita

[PATCH -next 1/2] leds: add /sys/devices/virtual/led-trigger/

2019-10-02 Thread Akinobu Mita

Reading /sys/class/leds//trigger returns all available LED triggers.
However, this violates the "one value per file" rule of sysfs.

This makes led_triggers "real" devices and provides an
/sys/devices/virtual/led-trigger/ directory that contains a sub-directoriy
for each LED trigger device. The name of the sub-directory matches the LED
trigger name.

We can find all available LED triggers by listing this directory contents.

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Jacek Anaszewski 
Cc: Pavel Machek 
Cc: Dan Murphy 
Signed-off-by: Akinobu Mita 
---
 .../ABI/testing/sysfs-devices-virtual-led-trigger  |  8 +++
 drivers/leds/led-triggers.c| 57 ++
 include/linux/leds.h   |  3 ++
 3 files changed, 68 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-devices-virtual-led-trigger

diff --git a/Documentation/ABI/testing/sysfs-devices-virtual-led-trigger 
b/Documentation/ABI/testing/sysfs-devices-virtual-led-trigger
new file mode 100644
index 000..b8eb8f3
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-devices-virtual-led-trigger
@@ -0,0 +1,8 @@
+What:  /sys/devices/virtual/leds-trigger/
+Date:  September 2019
+KernelVersion: 5.5
+Contact:   linux-l...@vger.kernel.org
+Description:
+   This directory contains a sub-directoriy for each LED trigger
+   device. The name of the sub-directory matches the LED trigger
+   name.
diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
index 79e30d2..0b810cf 100644
--- a/drivers/leds/led-triggers.c
+++ b/drivers/leds/led-triggers.c
@@ -267,21 +267,76 @@ void led_trigger_rename_static(const char *name, struct 
led_trigger *trig)
 }
 EXPORT_SYMBOL_GPL(led_trigger_rename_static);
 
+struct ledtrig_device {
+   struct device dev;
+};
+
+static void ledtrig_device_release(struct device *dev)
+{
+   struct ledtrig_device *trig_dev =
+   container_of(dev, struct ledtrig_device, dev);
+
+   kfree(trig_dev);
+}
+
+static struct bus_type led_trigger_subsys = {
+   .name = "led-trigger",
+};
+
+static int led_trigger_subsys_init(void)
+{
+   static DEFINE_MUTEX(init_mutex);
+   static bool init_done;
+   int ret = 0;
+
+   mutex_lock(_mutex);
+   if (!init_done) {
+   ret = subsys_virtual_register(_trigger_subsys, NULL);
+   if (!ret)
+   init_done = true;
+   }
+   mutex_unlock(_mutex);
+
+   return ret;
+}
+
 /* LED Trigger Interface */
 
 int led_trigger_register(struct led_trigger *trig)
 {
struct led_classdev *led_cdev;
struct led_trigger *_trig;
+   struct ledtrig_device *trig_dev;
+   int ret;
 
rwlock_init(>leddev_list_lock);
INIT_LIST_HEAD(>led_cdevs);
 
+   ret = led_trigger_subsys_init();
+   if (ret)
+   return ret;
+   trig_dev = kzalloc(sizeof(*trig_dev), GFP_KERNEL);
+   if (!trig_dev)
+   return -ENOMEM;
+
+   trig_dev->dev.bus = _trigger_subsys;
+   trig_dev->dev.release = ledtrig_device_release;
+   dev_set_name(_dev->dev, "%s", trig->name);
+
+   ret = device_register(_dev->dev);
+   if (ret) {
+   put_device(_dev->dev);
+   return ret;
+   }
+
+   trig->trig_dev = trig_dev;
+
down_write(_list_lock);
/* Make sure the trigger's name isn't already in use */
list_for_each_entry(_trig, _list, next_trig) {
if (!strcmp(_trig->name, trig->name)) {
up_write(_list_lock);
+   device_unregister(_dev->dev);
return -EEXIST;
}
}
@@ -327,6 +382,8 @@ void led_trigger_unregister(struct led_trigger *trig)
up_write(_cdev->trigger_lock);
}
up_read(_list_lock);
+
+   device_unregister(>trig_dev->dev);
 }
 EXPORT_SYMBOL_GPL(led_trigger_unregister);
 
diff --git a/include/linux/leds.h b/include/linux/leds.h
index da78b27..d63c8e7 100644
--- a/include/linux/leds.h
+++ b/include/linux/leds.h
@@ -336,6 +336,8 @@ static inline bool led_sysfs_is_disabled(struct 
led_classdev *led_cdev)
 
 #define TRIG_NAME_MAX 50
 
+struct ledtrig_device;
+
 struct led_trigger {
/* Trigger Properties */
const char   *name;
@@ -350,6 +352,7 @@ struct led_trigger {
struct list_head  next_trig;
 
const struct attribute_group **groups;
+   struct ledtrig_device *trig_dev;
 };
 
 /*
-- 
2.7.4

[PATCH -next 2/2] leds: add /sys/class/leds//current-trigger

2019-10-02 Thread Akinobu Mita

Reading /sys/class/leds//trigger returns all available LED triggers.
However, this violates the "one value per file" rule of sysfs.

This provides /sys/class/leds//current-trigger which is almost
identical to /sys/class/leds//trigger.  The only difference is that
'current-trigger' only shows the current trigger name.

This new file follows the "one value per file" rule of sysfs.
We can find all available LED triggers by listing the
/sys/devices/virtual/led-trigger/ directory.

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Jacek Anaszewski 
Cc: Pavel Machek 
Cc: Dan Murphy 
Signed-off-by: Akinobu Mita 
---
 Documentation/ABI/testing/sysfs-class-led | 13 +++
 drivers/leds/led-class.c  | 10 
 drivers/leds/led-triggers.c   | 38 +++
 drivers/leds/leds.h   |  5 
 4 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-led 
b/Documentation/ABI/testing/sysfs-class-led
index 5f67f7a..fdfed3f 100644
--- a/Documentation/ABI/testing/sysfs-class-led
+++ b/Documentation/ABI/testing/sysfs-class-led
@@ -61,3 +61,16 @@ Description:
gpio and backlight triggers. In case of the backlight trigger,
it is useful when driving a LED which is intended to indicate
a device in a standby like state.
+
+What:  /sys/class/leds//current-trigger
+Date:  September 2019
+KernelVersion: 5.5
+Contact:   linux-l...@vger.kernel.org
+Description:
+   Set the trigger for this LED. A trigger is a kernel based source
+   of LED events.
+   Writing the trigger name to this file will change the current
+   trigger. Trigger specific parameters can appear in
+   /sys/class/leds/ once a given trigger is selected. For
+   their documentation see sysfs-class-led-trigger-*.
+   Reading this file will return the current LED trigger name.
diff --git a/drivers/leds/led-class.c b/drivers/leds/led-class.c
index 3f04334..3cb0d8a 100644
--- a/drivers/leds/led-class.c
+++ b/drivers/leds/led-class.c
@@ -74,12 +74,22 @@ static ssize_t max_brightness_show(struct device *dev,
 static DEVICE_ATTR_RO(max_brightness);
 
 #ifdef CONFIG_LEDS_TRIGGERS
+
+static DEVICE_ATTR(current_trigger, 0644, led_current_trigger_show,
+  led_current_trigger_store);
+
+static struct attribute *led_current_trigger_attrs[] = {
+   _attr_current_trigger.attr,
+   NULL,
+};
+
 static BIN_ATTR(trigger, 0644, led_trigger_read, led_trigger_write, 0);
 static struct bin_attribute *led_trigger_bin_attrs[] = {
_attr_trigger,
NULL,
 };
 static const struct attribute_group led_trigger_group = {
+   .attrs = led_current_trigger_attrs,
.bin_attrs = led_trigger_bin_attrs,
 };
 #endif
diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
index 0b810cf..a2ef674 100644
--- a/drivers/leds/led-triggers.c
+++ b/drivers/leds/led-triggers.c
@@ -27,11 +27,9 @@ LIST_HEAD(trigger_list);
 
  /* Used by LED Class */
 
-ssize_t led_trigger_write(struct file *filp, struct kobject *kobj,
- struct bin_attribute *bin_attr, char *buf,
- loff_t pos, size_t count)
+static ssize_t led_trigger_store(struct device *dev, const char *buf,
+size_t count)
 {
-   struct device *dev = kobj_to_dev(kobj);
struct led_classdev *led_cdev = dev_get_drvdata(dev);
struct led_trigger *trig;
int ret = count;
@@ -67,8 +65,25 @@ ssize_t led_trigger_write(struct file *filp, struct kobject 
*kobj,
mutex_unlock(_cdev->led_access);
return ret;
 }
+
+ssize_t led_trigger_write(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *bin_attr, char *buf,
+ loff_t pos, size_t count)
+{
+   struct device *dev = kobj_to_dev(kobj);
+
+   return led_trigger_store(dev, buf, count);
+}
 EXPORT_SYMBOL_GPL(led_trigger_write);
 
+ssize_t led_current_trigger_store(struct device *dev,
+   struct device_attribute *attr, const char *buf,
+   size_t count)
+{
+   return led_trigger_store(dev, buf, count);
+}
+EXPORT_SYMBOL_GPL(led_current_trigger_store);
+
 __printf(3, 4)
 static int led_trigger_snprintf(char *buf, ssize_t size, const char *fmt, ...)
 {
@@ -144,6 +159,21 @@ ssize_t led_trigger_read(struct file *filp, struct kobject 
*kobj,
 }
 EXPORT_SYMBOL_GPL(led_trigger_read);
 
+ssize_t led_current_trigger_show(struct device *dev,
+struct device_attribute *attr, char *buf)
+{
+   struct led_classdev *led_cdev = dev_get_drvdata(dev);
+   int len;
+
+   down_read(_cdev->trigger_lock);
+   len = scnprintf(buf, PAGE_SIZE, "%s\n", led_cdev->trigger ?
+

[PATCH -next 0/2] leds: add substitutes for /sys/class/leds//trigger

2019-10-02 Thread Akinobu Mita

Reading /sys/class/leds//trigger returns all available LED triggers.
However, this violates the "one value per file" rule of sysfs.

This series provides a new /sys/devices/virtual/led-trigger/ directory and
/sys/class/leds//current-trigger. The new api follows the "one value
per file" rule of sysfs.

This series was previously developed as a part of the series "leds: fix
/sys/class/leds//trigger and add new api" [1].  Now this version
only contains the new api part.

[1] 
https://lore.kernel.org/r/1567946472-10075-1-git-send-email-akinobu.m...@gmail.com

Akinobu Mita (2):
  leds: add /sys/devices/virtual/led-trigger/
  leds: add /sys/class/leds//current-trigger

 Documentation/ABI/testing/sysfs-class-led  | 13 +++
 .../ABI/testing/sysfs-devices-virtual-led-trigger  |  8 ++
 drivers/leds/led-class.c   | 10 +++
 drivers/leds/led-triggers.c| 95 +-
 drivers/leds/leds.h|  5 ++
 include/linux/leds.h   |  3 +
 6 files changed, 130 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-devices-virtual-led-trigger

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Jacek Anaszewski 
Cc: Pavel Machek 
Cc: Dan Murphy 
-- 
2.7.4

[PATCH v3 1/1] leds: remove PAGE_SIZE limit of /sys/class/leds//trigger

2019-09-29 Thread Akinobu Mita

Reading /sys/class/leds//trigger returns all available LED triggers.
However, the size of this file is limited to PAGE_SIZE because of the
limitation for sysfs attribute.

Enabling LED CPU trigger on systems with thousands of CPUs easily hits
PAGE_SIZE limit, and makes it impossible to see all available LED triggers
and which trigger is currently activated.

We work around it here by converting /sys/class/leds//trigger to
binary attribute, which is not limited by length. This is _not_ good
design, do not copy it.

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Jacek Anaszewski 
Cc: Pavel Machek 
Cc: Dan Murphy 
Signed-off-by: Akinobu Mita 
---
 drivers/leds/led-class.c|  8 ++--
 drivers/leds/led-triggers.c | 90 ++---
 drivers/leds/leds.h |  6 +++
 include/linux/leds.h|  5 ---
 4 files changed, 78 insertions(+), 31 deletions(-)

diff --git a/drivers/leds/led-class.c b/drivers/leds/led-class.c
index 647b126..3f04334 100644
--- a/drivers/leds/led-class.c
+++ b/drivers/leds/led-class.c
@@ -74,13 +74,13 @@ static ssize_t max_brightness_show(struct device *dev,
 static DEVICE_ATTR_RO(max_brightness);
 
 #ifdef CONFIG_LEDS_TRIGGERS
-static DEVICE_ATTR(trigger, 0644, led_trigger_show, led_trigger_store);
-static struct attribute *led_trigger_attrs[] = {
-   _attr_trigger.attr,
+static BIN_ATTR(trigger, 0644, led_trigger_read, led_trigger_write, 0);
+static struct bin_attribute *led_trigger_bin_attrs[] = {
+   _attr_trigger,
NULL,
 };
 static const struct attribute_group led_trigger_group = {
-   .attrs = led_trigger_attrs,
+   .bin_attrs = led_trigger_bin_attrs,
 };
 #endif
 
diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
index 23963e5c..79e30d2 100644
--- a/drivers/leds/led-triggers.c
+++ b/drivers/leds/led-triggers.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "leds.h"
 
 /*
@@ -26,9 +27,11 @@ LIST_HEAD(trigger_list);
 
  /* Used by LED Class */
 
-ssize_t led_trigger_store(struct device *dev, struct device_attribute *attr,
-   const char *buf, size_t count)
+ssize_t led_trigger_write(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *bin_attr, char *buf,
+ loff_t pos, size_t count)
 {
+   struct device *dev = kobj_to_dev(kobj);
struct led_classdev *led_cdev = dev_get_drvdata(dev);
struct led_trigger *trig;
int ret = count;
@@ -64,39 +67,82 @@ ssize_t led_trigger_store(struct device *dev, struct 
device_attribute *attr,
mutex_unlock(_cdev->led_access);
return ret;
 }
-EXPORT_SYMBOL_GPL(led_trigger_store);
+EXPORT_SYMBOL_GPL(led_trigger_write);
 
-ssize_t led_trigger_show(struct device *dev, struct device_attribute *attr,
-   char *buf)
+__printf(3, 4)
+static int led_trigger_snprintf(char *buf, ssize_t size, const char *fmt, ...)
+{
+   va_list args;
+   int i;
+
+   va_start(args, fmt);
+   if (size <= 0)
+   i = vsnprintf(NULL, 0, fmt, args);
+   else
+   i = vscnprintf(buf, size, fmt, args);
+   va_end(args);
+
+   return i;
+}
+
+static int led_trigger_format(char *buf, size_t size,
+ struct led_classdev *led_cdev)
 {
-   struct led_classdev *led_cdev = dev_get_drvdata(dev);
struct led_trigger *trig;
-   int len = 0;
+   int len = led_trigger_snprintf(buf, size, "%s",
+  led_cdev->trigger ? "none" : "[none]");
+
+   list_for_each_entry(trig, _list, next_trig) {
+   bool hit = led_cdev->trigger &&
+   !strcmp(led_cdev->trigger->name, trig->name);
+
+   len += led_trigger_snprintf(buf + len, size - len,
+   " %s%s%s", hit ? "[" : "",
+   trig->name, hit ? "]" : "");
+   }
+
+   len += led_trigger_snprintf(buf + len, size - len, "\n");
+
+   return len;
+}
+
+/*
+ * It was stupid to create 1 cpu triggers, but we are stuck with it now.
+ * Don't make that mistake again. We work around it here by creating binary
+ * attribute, which is not limited by length. This is _not_ good design, do not
+ * copy it.
+ */
+ssize_t led_trigger_read(struct file *filp, struct kobject *kobj,
+   struct bin_attribute *attr, char *buf,
+   loff_t pos, size_t count)
+{
+   struct device *dev = kobj_to_dev(kobj);
+   struct led_classdev *led_cdev = dev_get_drvdata(dev);
+   void *data;
+   int len;
 
down_read(_list_lock);
down_read(_cdev->trigger_lock);
 
-   if (!led_cdev->trigger)
-   len += scnprintf(buf+len, PAGE_SIZE - len, "[none] ");
-

[PATCH v3 0/1] leds: fix /sys/class/leds//trigger

2019-09-29 Thread Akinobu Mita

Reading /sys/class/leds//trigger returns all available LED triggers.
However, the size of this file is limited to PAGE_SIZE because of the
limitation for sysfs attribute.

Enabling LED CPU trigger on systems with thousands of CPUs easily hits
PAGE_SIZE limit, and makes it impossible to see all available LED triggers
and which trigger is currently activated.

This patch converts /sys/class/leds//trigger to bin attribute and
removes the PAGE_SIZE limitation.

The first version of this seris provided the new api that follows the
"one value per file" rule of sysfs. The second version dropped it because
there have been a number of problems and it turns out that the new api
should be submitted separately.

* v3
- Remove "query" parameters from led_trigger_snprintf() and
  led_trigger_format()
- Return -ENOMEM immediately if memory allocation fails
- Drop Acked-by: tag due to a certain amount of changes

* v2
- Update commit message
- Drop patches for new api

Akinobu Mita (1):
  leds: remove PAGE_SIZE limit of /sys/class/leds//trigger

 drivers/leds/led-class.c|  8 ++--
 drivers/leds/led-triggers.c | 90 ++---
 drivers/leds/leds.h |  6 +++
 include/linux/leds.h|  5 ---
 4 files changed, 78 insertions(+), 31 deletions(-)

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Jacek Anaszewski 
Cc: Pavel Machek 
Cc: Dan Murphy 
-- 
2.7.4

Re: [PATCH v2 1/1] leds: remove PAGE_SIZE limit of /sys/class/leds//trigger

2019-09-27 Thread Akinobu Mita

2019年9月28日(土) 2:46 Greg Kroah-Hartman :
>
> On Sat, Sep 28, 2019 at 01:47:21AM +0900, Akinobu Mita wrote:
> > 2019年9月27日(金) 15:39 Greg Kroah-Hartman :
> > >
> > > On Sat, Sep 14, 2019 at 12:03:24AM +0900, Akinobu Mita wrote:
> > > > Reading /sys/class/leds//trigger returns all available LED 
> > > > triggers.
> > > > However, the size of this file is limited to PAGE_SIZE because of the
> > > > limitation for sysfs attribute.
> > > >
> > > > Enabling LED CPU trigger on systems with thousands of CPUs easily hits
> > > > PAGE_SIZE limit, and makes it impossible to see all available LED 
> > > > triggers
> > > > and which trigger is currently activated.
> > > >
> > > > We work around it here by converting /sys/class/leds//trigger to
> > > > binary attribute, which is not limited by length. This is _not_ good
> > > > design, do not copy it.
> > > >
> > > > Cc: Greg Kroah-Hartman 
> > > > Cc: "Rafael J. Wysocki" 
> > > > Cc: Jacek Anaszewski 
> > > > Cc: Pavel Machek 
> > > > Cc: Dan Murphy 
> > > > Acked-by: Pavel Machek 
> > > > Signed-off-by: Akinobu Mita 
> > > > ---
> > > >  drivers/leds/led-class.c|  8 ++--
> > > >  drivers/leds/led-triggers.c | 90 
> > > > ++---
> > > >  drivers/leds/leds.h |  6 +++
> > > >  include/linux/leds.h|  5 ---
> > > >  4 files changed, 79 insertions(+), 30 deletions(-)
> > > >
> > > > diff --git a/drivers/leds/led-class.c b/drivers/leds/led-class.c
> > > > index 4793e77..8b5a1d1 100644
> > > > --- a/drivers/leds/led-class.c
> > > > +++ b/drivers/leds/led-class.c
> > > > @@ -73,13 +73,13 @@ static ssize_t max_brightness_show(struct device 
> > > > *dev,
> > > >  static DEVICE_ATTR_RO(max_brightness);
> > > >
> > > >  #ifdef CONFIG_LEDS_TRIGGERS
> > > > -static DEVICE_ATTR(trigger, 0644, led_trigger_show, led_trigger_store);
> > > > -static struct attribute *led_trigger_attrs[] = {
> > > > - _attr_trigger.attr,
> > > > +static BIN_ATTR(trigger, 0644, led_trigger_read, led_trigger_write, 0);
> > > > +static struct bin_attribute *led_trigger_bin_attrs[] = {
> > > > + _attr_trigger,
> > > >   NULL,
> > > >  };
> > > >  static const struct attribute_group led_trigger_group = {
> > > > - .attrs = led_trigger_attrs,
> > > > + .bin_attrs = led_trigger_bin_attrs,
> > > >  };
> > > >  #endif
> > > >
> > > > diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
> > > > index 8d11a5e..ed5a311 100644
> > > > --- a/drivers/leds/led-triggers.c
> > > > +++ b/drivers/leds/led-triggers.c
> > > > @@ -16,6 +16,7 @@
> > > >  #include 
> > > >  #include 
> > > >  #include 
> > > > +#include 
> > > >  #include "leds.h"
> > > >
> > > >  /*
> > > > @@ -26,9 +27,11 @@ LIST_HEAD(trigger_list);
> > > >
> > > >   /* Used by LED Class */
> > > >
> > > > -ssize_t led_trigger_store(struct device *dev, struct device_attribute 
> > > > *attr,
> > > > - const char *buf, size_t count)
> > > > +ssize_t led_trigger_write(struct file *filp, struct kobject *kobj,
> > > > +   struct bin_attribute *bin_attr, char *buf,
> > > > +   loff_t pos, size_t count)
> > > >  {
> > > > + struct device *dev = kobj_to_dev(kobj);
> > > >   struct led_classdev *led_cdev = dev_get_drvdata(dev);
> > > >   struct led_trigger *trig;
> > > >   int ret = count;
> > > > @@ -64,39 +67,84 @@ ssize_t led_trigger_store(struct device *dev, 
> > > > struct device_attribute *attr,
> > > >   mutex_unlock(_cdev->led_access);
> > > >   return ret;
> > > >  }
> > > > -EXPORT_SYMBOL_GPL(led_trigger_store);
> > > > +EXPORT_SYMBOL_GPL(led_trigger_write);
> > > >
> > > > -ssize_t led_trigger_show(struct device *dev, struct device_attribute 
> > > > *attr,
> > > > - char *buf)
> > > > +__printf(4, 5)
> > > > +static int led_trigger_

Re: [PATCH v2 1/1] leds: remove PAGE_SIZE limit of /sys/class/leds//trigger

2019-09-27 Thread Akinobu Mita

2019年9月27日(金) 15:39 Greg Kroah-Hartman :
>
> On Sat, Sep 14, 2019 at 12:03:24AM +0900, Akinobu Mita wrote:
> > Reading /sys/class/leds//trigger returns all available LED triggers.
> > However, the size of this file is limited to PAGE_SIZE because of the
> > limitation for sysfs attribute.
> >
> > Enabling LED CPU trigger on systems with thousands of CPUs easily hits
> > PAGE_SIZE limit, and makes it impossible to see all available LED triggers
> > and which trigger is currently activated.
> >
> > We work around it here by converting /sys/class/leds//trigger to
> > binary attribute, which is not limited by length. This is _not_ good
> > design, do not copy it.
> >
> > Cc: Greg Kroah-Hartman 
> > Cc: "Rafael J. Wysocki" 
> > Cc: Jacek Anaszewski 
> > Cc: Pavel Machek 
> > Cc: Dan Murphy 
> > Acked-by: Pavel Machek 
> > Signed-off-by: Akinobu Mita 
> > ---
> >  drivers/leds/led-class.c|  8 ++--
> >  drivers/leds/led-triggers.c | 90 
> > ++---
> >  drivers/leds/leds.h |  6 +++
> >  include/linux/leds.h|  5 ---
> >  4 files changed, 79 insertions(+), 30 deletions(-)
> >
> > diff --git a/drivers/leds/led-class.c b/drivers/leds/led-class.c
> > index 4793e77..8b5a1d1 100644
> > --- a/drivers/leds/led-class.c
> > +++ b/drivers/leds/led-class.c
> > @@ -73,13 +73,13 @@ static ssize_t max_brightness_show(struct device *dev,
> >  static DEVICE_ATTR_RO(max_brightness);
> >
> >  #ifdef CONFIG_LEDS_TRIGGERS
> > -static DEVICE_ATTR(trigger, 0644, led_trigger_show, led_trigger_store);
> > -static struct attribute *led_trigger_attrs[] = {
> > - _attr_trigger.attr,
> > +static BIN_ATTR(trigger, 0644, led_trigger_read, led_trigger_write, 0);
> > +static struct bin_attribute *led_trigger_bin_attrs[] = {
> > + _attr_trigger,
> >   NULL,
> >  };
> >  static const struct attribute_group led_trigger_group = {
> > - .attrs = led_trigger_attrs,
> > + .bin_attrs = led_trigger_bin_attrs,
> >  };
> >  #endif
> >
> > diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
> > index 8d11a5e..ed5a311 100644
> > --- a/drivers/leds/led-triggers.c
> > +++ b/drivers/leds/led-triggers.c
> > @@ -16,6 +16,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include "leds.h"
> >
> >  /*
> > @@ -26,9 +27,11 @@ LIST_HEAD(trigger_list);
> >
> >   /* Used by LED Class */
> >
> > -ssize_t led_trigger_store(struct device *dev, struct device_attribute 
> > *attr,
> > - const char *buf, size_t count)
> > +ssize_t led_trigger_write(struct file *filp, struct kobject *kobj,
> > +   struct bin_attribute *bin_attr, char *buf,
> > +   loff_t pos, size_t count)
> >  {
> > + struct device *dev = kobj_to_dev(kobj);
> >   struct led_classdev *led_cdev = dev_get_drvdata(dev);
> >   struct led_trigger *trig;
> >   int ret = count;
> > @@ -64,39 +67,84 @@ ssize_t led_trigger_store(struct device *dev, struct 
> > device_attribute *attr,
> >   mutex_unlock(_cdev->led_access);
> >   return ret;
> >  }
> > -EXPORT_SYMBOL_GPL(led_trigger_store);
> > +EXPORT_SYMBOL_GPL(led_trigger_write);
> >
> > -ssize_t led_trigger_show(struct device *dev, struct device_attribute *attr,
> > - char *buf)
> > +__printf(4, 5)
> > +static int led_trigger_snprintf(char *buf, size_t size, bool query,
> > + const char *fmt, ...)
> > +{
> > + va_list args;
> > + int i;
> > +
> > + va_start(args, fmt);
> > + if (query)
> > + i = vsnprintf(NULL, 0, fmt, args);
> > + else
> > + i = vscnprintf(buf, size, fmt, args);
> > + va_end(args);
> > +
> > + return i;
> > +}
>
> You only call this in one place, why is it needed like this?  The "old"
> code open-coded this, what is this helping with here?
>
> And what does "query" mean here?  I have no idea how that variable
> matters, or what it does.  Why not just test if buf is NULL or not if
> you don't want to use it?
>
> Ah, you are trying to see how "long" the buffer is going to be.  That
> makes more sense, but just trigger off of the NULL buffer or not, making
> this a bit more "obvious" what you are doing and not tieing two
> parameters to e

Re: [PATCH] rtc: r7301: Use devm_platform_ioremap_resource() in rtc7301_rtc_probe()

2019-09-21 Thread Akinobu Mita

2019年9月21日(土) 20:49 Markus Elfring :
>
> From: Markus Elfring 
> Date: Sat, 21 Sep 2019 13:43:07 +0200
>
> Simplify this function implementation by using a known wrapper function.
>
> This issue was detected by using the Coccinelle software.
>
> Signed-off-by: Markus Elfring 

Reviewed-by: Akinobu Mita

[PATCH v2 1/1] leds: remove PAGE_SIZE limit of /sys/class/leds//trigger

2019-09-13 Thread Akinobu Mita

Reading /sys/class/leds//trigger returns all available LED triggers.
However, the size of this file is limited to PAGE_SIZE because of the
limitation for sysfs attribute.

Enabling LED CPU trigger on systems with thousands of CPUs easily hits
PAGE_SIZE limit, and makes it impossible to see all available LED triggers
and which trigger is currently activated.

We work around it here by converting /sys/class/leds//trigger to
binary attribute, which is not limited by length. This is _not_ good
design, do not copy it.

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Jacek Anaszewski 
Cc: Pavel Machek 
Cc: Dan Murphy 
Acked-by: Pavel Machek 
Signed-off-by: Akinobu Mita 
---
 drivers/leds/led-class.c|  8 ++--
 drivers/leds/led-triggers.c | 90 ++---
 drivers/leds/leds.h |  6 +++
 include/linux/leds.h|  5 ---
 4 files changed, 79 insertions(+), 30 deletions(-)

diff --git a/drivers/leds/led-class.c b/drivers/leds/led-class.c
index 4793e77..8b5a1d1 100644
--- a/drivers/leds/led-class.c
+++ b/drivers/leds/led-class.c
@@ -73,13 +73,13 @@ static ssize_t max_brightness_show(struct device *dev,
 static DEVICE_ATTR_RO(max_brightness);
 
 #ifdef CONFIG_LEDS_TRIGGERS
-static DEVICE_ATTR(trigger, 0644, led_trigger_show, led_trigger_store);
-static struct attribute *led_trigger_attrs[] = {
-   _attr_trigger.attr,
+static BIN_ATTR(trigger, 0644, led_trigger_read, led_trigger_write, 0);
+static struct bin_attribute *led_trigger_bin_attrs[] = {
+   _attr_trigger,
NULL,
 };
 static const struct attribute_group led_trigger_group = {
-   .attrs = led_trigger_attrs,
+   .bin_attrs = led_trigger_bin_attrs,
 };
 #endif
 
diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
index 8d11a5e..ed5a311 100644
--- a/drivers/leds/led-triggers.c
+++ b/drivers/leds/led-triggers.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "leds.h"
 
 /*
@@ -26,9 +27,11 @@ LIST_HEAD(trigger_list);
 
  /* Used by LED Class */
 
-ssize_t led_trigger_store(struct device *dev, struct device_attribute *attr,
-   const char *buf, size_t count)
+ssize_t led_trigger_write(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *bin_attr, char *buf,
+ loff_t pos, size_t count)
 {
+   struct device *dev = kobj_to_dev(kobj);
struct led_classdev *led_cdev = dev_get_drvdata(dev);
struct led_trigger *trig;
int ret = count;
@@ -64,39 +67,84 @@ ssize_t led_trigger_store(struct device *dev, struct 
device_attribute *attr,
mutex_unlock(_cdev->led_access);
return ret;
 }
-EXPORT_SYMBOL_GPL(led_trigger_store);
+EXPORT_SYMBOL_GPL(led_trigger_write);
 
-ssize_t led_trigger_show(struct device *dev, struct device_attribute *attr,
-   char *buf)
+__printf(4, 5)
+static int led_trigger_snprintf(char *buf, size_t size, bool query,
+   const char *fmt, ...)
+{
+   va_list args;
+   int i;
+
+   va_start(args, fmt);
+   if (query)
+   i = vsnprintf(NULL, 0, fmt, args);
+   else
+   i = vscnprintf(buf, size, fmt, args);
+   va_end(args);
+
+   return i;
+}
+
+static int led_trigger_format(char *buf, size_t size, bool query,
+ struct led_classdev *led_cdev)
 {
-   struct led_classdev *led_cdev = dev_get_drvdata(dev);
struct led_trigger *trig;
-   int len = 0;
+   int len = led_trigger_snprintf(buf, size, query, "%s",
+  led_cdev->trigger ? "none" : "[none]");
+
+   list_for_each_entry(trig, _list, next_trig) {
+   bool hit = led_cdev->trigger &&
+   !strcmp(led_cdev->trigger->name, trig->name);
+
+   len += led_trigger_snprintf(buf + len, size - len, query,
+   " %s%s%s", hit ? "[" : "",
+   trig->name, hit ? "]" : "");
+   }
+
+   len += led_trigger_snprintf(buf + len, size - len, query, "\n");
+
+   return len;
+}
+
+/*
+ * It was stupid to create 1 cpu triggers, but we are stuck with it now.
+ * Don't make that mistake again. We work around it here by creating binary
+ * attribute, which is not limited by length. This is _not_ good design, do not
+ * copy it.
+ */
+ssize_t led_trigger_read(struct file *filp, struct kobject *kobj,
+   struct bin_attribute *attr, char *buf,
+   loff_t pos, size_t count)
+{
+   struct device *dev = kobj_to_dev(kobj);
+   struct led_classdev *led_cdev = dev_get_drvdata(dev);
+   void *data;
+   int len;
 
down_read(_list_lock);
down_read(_cdev->trigger_lock);
 
-   if (!led_cdev->trigge

[PATCH v2 0/1] leds: fix /sys/class/leds//trigger

2019-09-13 Thread Akinobu Mita

(Resending with the version tag in the subject)

Reading /sys/class/leds//trigger returns all available LED triggers.
However, the size of this file is limited to PAGE_SIZE because of the
limitation for sysfs attribute.

Enabling LED CPU trigger on systems with thousands of CPUs easily hits
PAGE_SIZE limit, and makes it impossible to see all available LED triggers
and which trigger is currently activated.

This patch converts /sys/class/leds//trigger to bin attribute and
removes the PAGE_SIZE limitation.

The first version of this seris provided the new api that follows the
"one value per file" rule of sysfs. This second version dropped it because
there have been a number of problems and it turns out that the new api
should be submitted separately.

* v2
- Update commit message
- Drop patches for new api

Akinobu Mita (1):
  leds: remove PAGE_SIZE limit of /sys/class/leds//trigger

 drivers/leds/led-class.c|  8 ++--
 drivers/leds/led-triggers.c | 90 ++---
 drivers/leds/leds.h |  6 +++
 include/linux/leds.h|  5 ---
 4 files changed, 79 insertions(+), 30 deletions(-)

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Jacek Anaszewski 
Cc: Pavel Machek 
Cc: Dan Murphy 
-- 
2.7.4

Re: [PATCH] leds: remove PAGE_SIZE limit of /sys/class/leds//trigger

2019-09-12 Thread Akinobu Mita

2019年9月13日(金) 2:15 Jacek Anaszewski :
>
> Hi Akinobu,
>
> Please bump patch version each time you send an update
> of the patch with the same subject.

Oops, should I resend with the correct subject?

[PATCH] leds: remove PAGE_SIZE limit of /sys/class/leds//trigger

2019-09-12 Thread Akinobu Mita

Reading /sys/class/leds//trigger returns all available LED triggers.
However, the size of this file is limited to PAGE_SIZE because of the
limitation for sysfs attribute.

Enabling LED CPU trigger on systems with thousands of CPUs easily hits
PAGE_SIZE limit, and makes it impossible to see all available LED triggers
and which trigger is currently activated.

We work around it here by converting /sys/class/leds//trigger to
binary attribute, which is not limited by length. This is _not_ good
design, do not copy it.

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Jacek Anaszewski 
Cc: Pavel Machek 
Cc: Dan Murphy 
Acked-by: Pavel Machek 
Signed-off-by: Akinobu Mita 
---
 drivers/leds/led-class.c|  8 ++--
 drivers/leds/led-triggers.c | 90 ++---
 drivers/leds/leds.h |  6 +++
 include/linux/leds.h|  5 ---
 4 files changed, 79 insertions(+), 30 deletions(-)

diff --git a/drivers/leds/led-class.c b/drivers/leds/led-class.c
index 4793e77..8b5a1d1 100644
--- a/drivers/leds/led-class.c
+++ b/drivers/leds/led-class.c
@@ -73,13 +73,13 @@ static ssize_t max_brightness_show(struct device *dev,
 static DEVICE_ATTR_RO(max_brightness);
 
 #ifdef CONFIG_LEDS_TRIGGERS
-static DEVICE_ATTR(trigger, 0644, led_trigger_show, led_trigger_store);
-static struct attribute *led_trigger_attrs[] = {
-   _attr_trigger.attr,
+static BIN_ATTR(trigger, 0644, led_trigger_read, led_trigger_write, 0);
+static struct bin_attribute *led_trigger_bin_attrs[] = {
+   _attr_trigger,
NULL,
 };
 static const struct attribute_group led_trigger_group = {
-   .attrs = led_trigger_attrs,
+   .bin_attrs = led_trigger_bin_attrs,
 };
 #endif
 
diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
index 8d11a5e..ed5a311 100644
--- a/drivers/leds/led-triggers.c
+++ b/drivers/leds/led-triggers.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "leds.h"
 
 /*
@@ -26,9 +27,11 @@ LIST_HEAD(trigger_list);
 
  /* Used by LED Class */
 
-ssize_t led_trigger_store(struct device *dev, struct device_attribute *attr,
-   const char *buf, size_t count)
+ssize_t led_trigger_write(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *bin_attr, char *buf,
+ loff_t pos, size_t count)
 {
+   struct device *dev = kobj_to_dev(kobj);
struct led_classdev *led_cdev = dev_get_drvdata(dev);
struct led_trigger *trig;
int ret = count;
@@ -64,39 +67,84 @@ ssize_t led_trigger_store(struct device *dev, struct 
device_attribute *attr,
mutex_unlock(_cdev->led_access);
return ret;
 }
-EXPORT_SYMBOL_GPL(led_trigger_store);
+EXPORT_SYMBOL_GPL(led_trigger_write);
 
-ssize_t led_trigger_show(struct device *dev, struct device_attribute *attr,
-   char *buf)
+__printf(4, 5)
+static int led_trigger_snprintf(char *buf, size_t size, bool query,
+   const char *fmt, ...)
+{
+   va_list args;
+   int i;
+
+   va_start(args, fmt);
+   if (query)
+   i = vsnprintf(NULL, 0, fmt, args);
+   else
+   i = vscnprintf(buf, size, fmt, args);
+   va_end(args);
+
+   return i;
+}
+
+static int led_trigger_format(char *buf, size_t size, bool query,
+ struct led_classdev *led_cdev)
 {
-   struct led_classdev *led_cdev = dev_get_drvdata(dev);
struct led_trigger *trig;
-   int len = 0;
+   int len = led_trigger_snprintf(buf, size, query, "%s",
+  led_cdev->trigger ? "none" : "[none]");
+
+   list_for_each_entry(trig, _list, next_trig) {
+   bool hit = led_cdev->trigger &&
+   !strcmp(led_cdev->trigger->name, trig->name);
+
+   len += led_trigger_snprintf(buf + len, size - len, query,
+   " %s%s%s", hit ? "[" : "",
+   trig->name, hit ? "]" : "");
+   }
+
+   len += led_trigger_snprintf(buf + len, size - len, query, "\n");
+
+   return len;
+}
+
+/*
+ * It was stupid to create 1 cpu triggers, but we are stuck with it now.
+ * Don't make that mistake again. We work around it here by creating binary
+ * attribute, which is not limited by length. This is _not_ good design, do not
+ * copy it.
+ */
+ssize_t led_trigger_read(struct file *filp, struct kobject *kobj,
+   struct bin_attribute *attr, char *buf,
+   loff_t pos, size_t count)
+{
+   struct device *dev = kobj_to_dev(kobj);
+   struct led_classdev *led_cdev = dev_get_drvdata(dev);
+   void *data;
+   int len;
 
down_read(_list_lock);
down_read(_cdev->trigger_lock);
 
-   if (!led_cdev->trigge

[PATCH] leds: fix /sys/class/leds//trigger

2019-09-12 Thread Akinobu Mita

Reading /sys/class/leds//trigger returns all available LED triggers.
However, the size of this file is limited to PAGE_SIZE because of the
limitation for sysfs attribute.

Enabling LED CPU trigger on systems with thousands of CPUs easily hits
PAGE_SIZE limit, and makes it impossible to see all available LED triggers
and which trigger is currently activated.

This patch converts /sys/class/leds//trigger to bin attribute and
removes the PAGE_SIZE limitation.

The first version of this seris provided the new api that follows the
"one value per file" rule of sysfs. This second version dropped it because
there have been a number of problems and it turns out that the new api
should be submitted separately.

* v2
- Update commit message
- Drop patches for new api

Akinobu Mita (1):
  leds: remove PAGE_SIZE limit of /sys/class/leds//trigger

 drivers/leds/led-class.c|  8 ++--
 drivers/leds/led-triggers.c | 90 ++---
 drivers/leds/leds.h |  6 +++
 include/linux/leds.h|  5 ---
 4 files changed, 79 insertions(+), 30 deletions(-)

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Jacek Anaszewski 
Cc: Pavel Machek 
Cc: Dan Murphy 
-- 
2.7.4

Re: [PATCH 1/5] leds: remove PAGE_SIZE limit of /sys/class/leds//trigger

2019-09-11 Thread Akinobu Mita

2019年9月12日(木) 0:36 Greg Kroah-Hartman :
>
> On Thu, Sep 12, 2019 at 12:25:28AM +0900, Akinobu Mita wrote:
> > 2019年9月8日(日) 22:10 Greg Kroah-Hartman :
> > >
> > > On Sun, Sep 08, 2019 at 09:41:08PM +0900, Akinobu Mita wrote:
> > > > Reading /sys/class/leds//trigger returns all available LED 
> > > > triggers.
> > > > However, the size of this file is limited to PAGE_SIZE because of the
> > > > limitation for sysfs attribute.
> > > >
> > > > Enabling LED CPU trigger on systems with thousands of CPUs easily hits
> > > > PAGE_SIZE limit, and makes it impossible to see all available LED 
> > > > triggers
> > > > and which trigger is currently activated.
> > > >
> > > > This converts /sys/class/leds//trigger to bin attribute and removes
> > > > the PAGE_SIZE limitation.
> > > >
> > > > Cc: Greg Kroah-Hartman 
> > > > Cc: "Rafael J. Wysocki" 
> > > > Cc: Jacek Anaszewski 
> > > > Cc: Pavel Machek 
> > > > Cc: Dan Murphy 
> > > > Acked-by: Pavel Machek 
> > > > Signed-off-by: Akinobu Mita 
> > > > ---
> > > >  drivers/leds/led-class.c|  8 ++--
> > > >  drivers/leds/led-triggers.c | 90 
> > > > ++---
> > > >  drivers/leds/leds.h |  6 +++
> > > >  include/linux/leds.h|  5 ---
> > > >  4 files changed, 79 insertions(+), 30 deletions(-)
> > > >
> > > > diff --git a/drivers/leds/led-class.c b/drivers/leds/led-class.c
> > > > index 4793e77..8b5a1d1 100644
> > > > --- a/drivers/leds/led-class.c
> > > > +++ b/drivers/leds/led-class.c
> > > > @@ -73,13 +73,13 @@ static ssize_t max_brightness_show(struct device 
> > > > *dev,
> > > >  static DEVICE_ATTR_RO(max_brightness);
> > > >
> > > >  #ifdef CONFIG_LEDS_TRIGGERS
> > > > -static DEVICE_ATTR(trigger, 0644, led_trigger_show, led_trigger_store);
> > > > -static struct attribute *led_trigger_attrs[] = {
> > > > - _attr_trigger.attr,
> > > > +static BIN_ATTR(trigger, 0644, led_trigger_read, led_trigger_write, 0);
> > >
> > > BIN_ATTR_RW()?
> >
> > We can use BIN_ATTR_RW() by renaming led_trigger_{read,write}() to
> > trigger_{read,write}().  But led_trigger_{read,write}() are not static
> > functions.  These are defined as export symbols for led-class module.
> >
> > So trigger_{read,write}() will be too generic symbol names, won't they?
>
> Yes they would, sorry I didn't notice that.
>
> Wait, why are those functions being exported?  Who is calling a sysfs
> function from a different code path than sysfs?

led-class.c :)

led_trigger_{read,write}() are defined in led-triggers.c which is built
into the kernel. led-class.c can be built as module.

Re: [PATCH 1/5] leds: remove PAGE_SIZE limit of /sys/class/leds//trigger

2019-09-11 Thread Akinobu Mita

2019年9月8日(日) 22:10 Greg Kroah-Hartman :
>
> On Sun, Sep 08, 2019 at 09:41:08PM +0900, Akinobu Mita wrote:
> > Reading /sys/class/leds//trigger returns all available LED triggers.
> > However, the size of this file is limited to PAGE_SIZE because of the
> > limitation for sysfs attribute.
> >
> > Enabling LED CPU trigger on systems with thousands of CPUs easily hits
> > PAGE_SIZE limit, and makes it impossible to see all available LED triggers
> > and which trigger is currently activated.
> >
> > This converts /sys/class/leds//trigger to bin attribute and removes
> > the PAGE_SIZE limitation.
> >
> > Cc: Greg Kroah-Hartman 
> > Cc: "Rafael J. Wysocki" 
> > Cc: Jacek Anaszewski 
> > Cc: Pavel Machek 
> > Cc: Dan Murphy 
> > Acked-by: Pavel Machek 
> > Signed-off-by: Akinobu Mita 
> > ---
> >  drivers/leds/led-class.c|  8 ++--
> >  drivers/leds/led-triggers.c | 90 
> > ++---
> >  drivers/leds/leds.h |  6 +++
> >  include/linux/leds.h|  5 ---
> >  4 files changed, 79 insertions(+), 30 deletions(-)
> >
> > diff --git a/drivers/leds/led-class.c b/drivers/leds/led-class.c
> > index 4793e77..8b5a1d1 100644
> > --- a/drivers/leds/led-class.c
> > +++ b/drivers/leds/led-class.c
> > @@ -73,13 +73,13 @@ static ssize_t max_brightness_show(struct device *dev,
> >  static DEVICE_ATTR_RO(max_brightness);
> >
> >  #ifdef CONFIG_LEDS_TRIGGERS
> > -static DEVICE_ATTR(trigger, 0644, led_trigger_show, led_trigger_store);
> > -static struct attribute *led_trigger_attrs[] = {
> > - _attr_trigger.attr,
> > +static BIN_ATTR(trigger, 0644, led_trigger_read, led_trigger_write, 0);
>
> BIN_ATTR_RW()?

We can use BIN_ATTR_RW() by renaming led_trigger_{read,write}() to
trigger_{read,write}().  But led_trigger_{read,write}() are not static
functions.  These are defined as export symbols for led-class module.

So trigger_{read,write}() will be too generic symbol names, won't they?

[PATCH 5/5] leds: add /sys/class/leds//current-trigger

2019-09-08 Thread Akinobu Mita

Reading /sys/class/leds//trigger returns all available LED triggers.
However, this violates the "one value per file" rule of sysfs.

This provides /sys/class/leds//current-trigger which is almost
identical to /sys/class/leds//trigger.  The only difference is that
'current-trigger' only shows the current trigger name.

This new file follows the "one value per file" rule of sysfs.
We can use the /sys/class/triggers directory to get the list of available
LED triggers.

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Jacek Anaszewski 
Cc: Pavel Machek 
Cc: Dan Murphy 
Signed-off-by: Akinobu Mita 
---
 Documentation/ABI/testing/sysfs-class-led | 13 +++
 drivers/leds/led-class.c  |  7 ++
 drivers/leds/led-triggers.c   | 38 +++
 drivers/leds/leds.h   |  5 
 4 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-led 
b/Documentation/ABI/testing/sysfs-class-led
index 14d91af..1a1be10 100644
--- a/Documentation/ABI/testing/sysfs-class-led
+++ b/Documentation/ABI/testing/sysfs-class-led
@@ -70,3 +70,16 @@ Description:
This directory contains a number of sub-directories, each
representing an LED trigger. The name of the sub-directory
matches the LED trigger name.
+
+What:  /sys/class/leds//current-trigger
+Date:  September 2019
+KernelVersion: 5.5
+Contact:   linux-l...@vger.kernel.org
+Description:
+   Set the trigger for this LED. A trigger is a kernel based source
+   of LED events.
+   Writing the trigger name to this file will change the current
+   trigger. Trigger specific parameters can appear in
+   /sys/class/leds/ once a given trigger is selected. For
+   their documentation see sysfs-class-led-trigger-*.
+   Reading this file will return the current LED trigger name.
diff --git a/drivers/leds/led-class.c b/drivers/leds/led-class.c
index 04e6c14..388500b 100644
--- a/drivers/leds/led-class.c
+++ b/drivers/leds/led-class.c
@@ -73,12 +73,19 @@ static ssize_t max_brightness_show(struct device *dev,
 static DEVICE_ATTR_RO(max_brightness);
 
 #ifdef CONFIG_LEDS_TRIGGERS
+static DEVICE_ATTR(current_trigger, 0644, led_current_trigger_show,
+  led_current_trigger_store);
+static struct attribute *led_current_trigger_attrs[] = {
+   _attr_current_trigger.attr,
+   NULL,
+};
 static BIN_ATTR(trigger, 0644, led_trigger_read, led_trigger_write, 0);
 static struct bin_attribute *led_trigger_bin_attrs[] = {
_attr_trigger,
NULL,
 };
 static const struct attribute_group led_trigger_group = {
+   .attrs = led_current_trigger_attrs,
.bin_attrs = led_trigger_bin_attrs,
 };
 
diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
index 4a86964..41bcc508 100644
--- a/drivers/leds/led-triggers.c
+++ b/drivers/leds/led-triggers.c
@@ -27,11 +27,9 @@ LIST_HEAD(trigger_list);
 
  /* Used by LED Class */
 
-ssize_t led_trigger_write(struct file *filp, struct kobject *kobj,
- struct bin_attribute *bin_attr, char *buf,
- loff_t pos, size_t count)
+static ssize_t led_trigger_store(struct device *dev, const char *buf,
+size_t count)
 {
-   struct device *dev = kobj_to_dev(kobj);
struct led_classdev *led_cdev = dev_get_drvdata(dev);
struct led_trigger *trig;
int ret = count;
@@ -67,8 +65,25 @@ ssize_t led_trigger_write(struct file *filp, struct kobject 
*kobj,
mutex_unlock(_cdev->led_access);
return ret;
 }
+
+ssize_t led_trigger_write(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *bin_attr, char *buf,
+ loff_t pos, size_t count)
+{
+   struct device *dev = kobj_to_dev(kobj);
+
+   return led_trigger_store(dev, buf, count);
+}
 EXPORT_SYMBOL_GPL(led_trigger_write);
 
+ssize_t led_current_trigger_store(struct device *dev,
+   struct device_attribute *attr, const char *buf,
+   size_t count)
+{
+   return led_trigger_store(dev, buf, count);
+}
+EXPORT_SYMBOL_GPL(led_current_trigger_store);
+
 __printf(4, 5)
 static int led_trigger_snprintf(char *buf, size_t size, bool query,
const char *fmt, ...)
@@ -146,6 +161,21 @@ ssize_t led_trigger_read(struct file *filp, struct kobject 
*kobj,
 }
 EXPORT_SYMBOL_GPL(led_trigger_read);
 
+ssize_t led_current_trigger_show(struct device *dev,
+struct device_attribute *attr, char *buf)
+{
+   struct led_classdev *led_cdev = dev_get_drvdata(dev);
+   int len;
+
+   down_read(_cdev->trigger_lock);
+   len = scnprintf(buf, PAGE_SIZE, "%s\n", led_cdev->trigger ?
+

[PATCH 4/5] leds: add /sys/class/triggers/ that contains trigger sub-directories

2019-09-08 Thread Akinobu Mita

Reading /sys/class/leds//trigger returns all available LED triggers.
However, this violates the "one value per file" rule of sysfs.

This provides /sys/class/leds/triggers directory that contains a number of
sub-directories, each representing an LED trigger. The name of the
sub-directory matches the LED trigger name.

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Jacek Anaszewski 
Cc: Pavel Machek 
Cc: Dan Murphy 
Signed-off-by: Akinobu Mita 
---
 Documentation/ABI/testing/sysfs-class-led |  9 +
 drivers/leds/led-class.c  | 32 +++
 drivers/leds/led-triggers.c   | 19 ++
 drivers/leds/leds.h   |  1 +
 include/linux/leds.h  |  1 +
 5 files changed, 62 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-class-led 
b/Documentation/ABI/testing/sysfs-class-led
index 5f67f7a..14d91af 100644
--- a/Documentation/ABI/testing/sysfs-class-led
+++ b/Documentation/ABI/testing/sysfs-class-led
@@ -61,3 +61,12 @@ Description:
gpio and backlight triggers. In case of the backlight trigger,
it is useful when driving a LED which is intended to indicate
a device in a standby like state.
+
+What:  /sys/class/leds/triggers/
+Date:  September 2019
+KernelVersion: 5.5
+Contact:   linux-l...@vger.kernel.org
+Description:
+   This directory contains a number of sub-directories, each
+   representing an LED trigger. The name of the sub-directory
+   matches the LED trigger name.
diff --git a/drivers/leds/led-class.c b/drivers/leds/led-class.c
index 7d85181..04e6c14 100644
--- a/drivers/leds/led-class.c
+++ b/drivers/leds/led-class.c
@@ -81,6 +81,28 @@ static struct bin_attribute *led_trigger_bin_attrs[] = {
 static const struct attribute_group led_trigger_group = {
.bin_attrs = led_trigger_bin_attrs,
 };
+
+static int led_triggers_kobj_create(void)
+{
+   led_triggers_kobj = class_kobject_create_and_add("triggers",
+leds_class);
+
+   return led_triggers_kobj ? 0 : -ENOMEM;
+}
+
+static void led_triggers_kobj_destroy(void)
+{
+   kobject_put(led_triggers_kobj);
+}
+
+#else
+static inline int led_triggers_kobj_create(void)
+{
+   return 0;
+}
+static void led_triggers_kobj_destroy(void)
+{
+}
 #endif
 
 static struct attribute *led_class_attrs[] = {
@@ -411,16 +433,26 @@ EXPORT_SYMBOL_GPL(devm_led_classdev_unregister);
 
 static int __init leds_init(void)
 {
+   int ret;
+
leds_class = class_create(THIS_MODULE, "leds");
if (IS_ERR(leds_class))
return PTR_ERR(leds_class);
leds_class->pm = _class_dev_pm_ops;
leds_class->dev_groups = led_groups;
+
+   ret = led_triggers_kobj_create();
+   if (ret) {
+   class_unregister(leds_class);
+   return ret;
+   }
+
return 0;
 }
 
 static void __exit leds_exit(void)
 {
+   led_triggers_kobj_destroy();
class_destroy(leds_class);
 }
 
diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
index ed5a311..4a86964 100644
--- a/drivers/leds/led-triggers.c
+++ b/drivers/leds/led-triggers.c
@@ -268,16 +268,26 @@ void led_trigger_rename_static(const char *name, struct 
led_trigger *trig)
 }
 EXPORT_SYMBOL_GPL(led_trigger_rename_static);
 
+static struct kobj_type led_trigger_kobj_type = {
+   .sysfs_ops = _sysfs_ops,
+};
+
+struct kobject *led_triggers_kobj;
+EXPORT_SYMBOL_GPL(led_triggers_kobj);
+
 /* LED Trigger Interface */
 
 int led_trigger_register(struct led_trigger *trig)
 {
struct led_classdev *led_cdev;
struct led_trigger *_trig;
+   int ret;
 
rwlock_init(>leddev_list_lock);
INIT_LIST_HEAD(>led_cdevs);
 
+   kobject_init(>kobj, _trigger_kobj_type);
+
down_write(_list_lock);
/* Make sure the trigger's name isn't already in use */
list_for_each_entry(_trig, _list, next_trig) {
@@ -286,6 +296,14 @@ int led_trigger_register(struct led_trigger *trig)
return -EEXIST;
}
}
+
+   WARN_ON_ONCE(!led_triggers_kobj);
+   ret = kobject_add(>kobj, led_triggers_kobj, "%s", trig->name);
+   if (ret) {
+   up_write(_list_lock);
+   return ret;
+   }
+
/* Add to the list of led triggers */
list_add_tail(>next_trig, _list);
up_write(_list_lock);
@@ -316,6 +334,7 @@ void led_trigger_unregister(struct led_trigger *trig)
 
/* Remove from the list of led triggers */
down_write(_list_lock);
+   kobject_put(>kobj);
list_del_init(>next_trig);
up_write(_list_lock);
 
diff --git a/drivers/leds/leds.h b/drivers/leds/leds.h
index a0ee33c..52debe0 100644
--- a/drivers/leds/leds.h
+++ b/drivers/leds/leds

[PATCH 2/5] leds: make sure leds_class is initialized before triggers are registered

2019-09-08 Thread Akinobu Mita

If the led-class and usb-common modules are built into the kernel, the
usb-common module could be initialized earlier than the led-class module.

So when the ledtrig_usb_gadget and ledtrig_usb_host LED triggers are
registered by usb-common module, the leds_class could not be initialized
yet.

We are going to populate sub-directories, each representing an LED
trigger in /sys/class/triggers/, so leds_class needs to be initialized
before any LED triggers is registered.

This makes led-class initialize earlier then usb-common by changing
initcall group.

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Jacek Anaszewski 
Cc: Pavel Machek 
Cc: Dan Murphy 
Signed-off-by: Akinobu Mita 
---
 drivers/leds/led-class.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/leds/led-class.c b/drivers/leds/led-class.c
index 8b5a1d1..7d85181 100644
--- a/drivers/leds/led-class.c
+++ b/drivers/leds/led-class.c
@@ -424,7 +424,7 @@ static void __exit leds_exit(void)
class_destroy(leds_class);
 }
 
-subsys_initcall(leds_init);
+postcore_initcall(leds_init);
 module_exit(leds_exit);
 
 MODULE_AUTHOR("John Lenz, Richard Purdie");
-- 
2.7.4

[PATCH 0/5] leds: fix /sys/class/leds//trigger and add new api

2019-09-08 Thread Akinobu Mita

Reading /sys/class/leds//trigger returns all available LED triggers.
However, the size of this file is limited to PAGE_SIZE because of the
limitation for sysfs attribute.

Enabling LED CPU trigger on systems with thousands of CPUs easily hits
PAGE_SIZE limit, and makes it impossible to see all available LED triggers
and which trigger is currently activated.

The first patch in this series converts /sys/class/leds//trigger to
bin attribute and removes the PAGE_SIZE limitation.

The rest of series provides a new /sys/class/triggers/ directory and
/sys/class/leds//current-trigger. The new api follows the "one value
per file" rule of sysfs.

Akinobu Mita (5):
  leds: remove PAGE_SIZE limit of /sys/class/leds//trigger
  leds: make sure leds_class is initialized before triggers are
registered
  driver core: class: add function to create /sys/class//foo
directory
  leds: add /sys/class/triggers/ that contains trigger sub-directories
  leds: add /sys/class/leds//current-trigger

 Documentation/ABI/testing/sysfs-class-led |  22 +
 drivers/base/class.c  |   7 ++
 drivers/leds/led-class.c  |  49 +--
 drivers/leds/led-triggers.c   | 139 +-
 drivers/leds/leds.h   |  12 +++
 include/linux/device.h|   3 +
 include/linux/leds.h  |   6 +-
 7 files changed, 207 insertions(+), 31 deletions(-)

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Jacek Anaszewski 
Cc: Pavel Machek 
Cc: Dan Murphy 
-- 
2.7.4

[PATCH 3/5] driver core: class: add function to create /sys/class//foo directory

2019-09-08 Thread Akinobu Mita

This adds a new function class_kobject_create_and_add() that creates a
directory in the /sys/class/.

This function is required to create the /sys/class/leds/triggers directory
that contains all available LED triggers.

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Jacek Anaszewski 
Cc: Pavel Machek 
Cc: Dan Murphy 
Signed-off-by: Akinobu Mita 
---
 drivers/base/class.c   | 7 +++
 include/linux/device.h | 3 +++
 2 files changed, 10 insertions(+)

diff --git a/drivers/base/class.c b/drivers/base/class.c
index d8a6a58..f4c53e7 100644
--- a/drivers/base/class.c
+++ b/drivers/base/class.c
@@ -104,6 +104,13 @@ void class_remove_file_ns(struct class *cls, const struct 
class_attribute *attr,
sysfs_remove_file_ns(>p->subsys.kobj, >attr, ns);
 }
 
+struct kobject *class_kobject_create_and_add(const char *name,
+struct class *cls)
+{
+   return kobject_create_and_add(name, >p->subsys.kobj);
+}
+EXPORT_SYMBOL_GPL(class_kobject_create_and_add);
+
 static struct class *class_get(struct class *cls)
 {
if (cls)
diff --git a/include/linux/device.h b/include/linux/device.h
index 6717ade..335e901 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -505,6 +505,9 @@ static inline void class_remove_file(struct class *class,
return class_remove_file_ns(class, attr, NULL);
 }
 
+struct kobject * __must_check class_kobject_create_and_add(const char *name,
+  struct class *cls);
+
 /* Simple class attribute that is just a static string */
 struct class_attribute_string {
struct class_attribute attr;
-- 
2.7.4

[PATCH 1/5] leds: remove PAGE_SIZE limit of /sys/class/leds//trigger

2019-09-08 Thread Akinobu Mita

Reading /sys/class/leds//trigger returns all available LED triggers.
However, the size of this file is limited to PAGE_SIZE because of the
limitation for sysfs attribute.

Enabling LED CPU trigger on systems with thousands of CPUs easily hits
PAGE_SIZE limit, and makes it impossible to see all available LED triggers
and which trigger is currently activated.

This converts /sys/class/leds//trigger to bin attribute and removes
the PAGE_SIZE limitation.

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Jacek Anaszewski 
Cc: Pavel Machek 
Cc: Dan Murphy 
Acked-by: Pavel Machek 
Signed-off-by: Akinobu Mita 
---
 drivers/leds/led-class.c|  8 ++--
 drivers/leds/led-triggers.c | 90 ++---
 drivers/leds/leds.h |  6 +++
 include/linux/leds.h|  5 ---
 4 files changed, 79 insertions(+), 30 deletions(-)

diff --git a/drivers/leds/led-class.c b/drivers/leds/led-class.c
index 4793e77..8b5a1d1 100644
--- a/drivers/leds/led-class.c
+++ b/drivers/leds/led-class.c
@@ -73,13 +73,13 @@ static ssize_t max_brightness_show(struct device *dev,
 static DEVICE_ATTR_RO(max_brightness);
 
 #ifdef CONFIG_LEDS_TRIGGERS
-static DEVICE_ATTR(trigger, 0644, led_trigger_show, led_trigger_store);
-static struct attribute *led_trigger_attrs[] = {
-   _attr_trigger.attr,
+static BIN_ATTR(trigger, 0644, led_trigger_read, led_trigger_write, 0);
+static struct bin_attribute *led_trigger_bin_attrs[] = {
+   _attr_trigger,
NULL,
 };
 static const struct attribute_group led_trigger_group = {
-   .attrs = led_trigger_attrs,
+   .bin_attrs = led_trigger_bin_attrs,
 };
 #endif
 
diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
index 8d11a5e..ed5a311 100644
--- a/drivers/leds/led-triggers.c
+++ b/drivers/leds/led-triggers.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "leds.h"
 
 /*
@@ -26,9 +27,11 @@ LIST_HEAD(trigger_list);
 
  /* Used by LED Class */
 
-ssize_t led_trigger_store(struct device *dev, struct device_attribute *attr,
-   const char *buf, size_t count)
+ssize_t led_trigger_write(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *bin_attr, char *buf,
+ loff_t pos, size_t count)
 {
+   struct device *dev = kobj_to_dev(kobj);
struct led_classdev *led_cdev = dev_get_drvdata(dev);
struct led_trigger *trig;
int ret = count;
@@ -64,39 +67,84 @@ ssize_t led_trigger_store(struct device *dev, struct 
device_attribute *attr,
mutex_unlock(_cdev->led_access);
return ret;
 }
-EXPORT_SYMBOL_GPL(led_trigger_store);
+EXPORT_SYMBOL_GPL(led_trigger_write);
 
-ssize_t led_trigger_show(struct device *dev, struct device_attribute *attr,
-   char *buf)
+__printf(4, 5)
+static int led_trigger_snprintf(char *buf, size_t size, bool query,
+   const char *fmt, ...)
+{
+   va_list args;
+   int i;
+
+   va_start(args, fmt);
+   if (query)
+   i = vsnprintf(NULL, 0, fmt, args);
+   else
+   i = vscnprintf(buf, size, fmt, args);
+   va_end(args);
+
+   return i;
+}
+
+static int led_trigger_format(char *buf, size_t size, bool query,
+ struct led_classdev *led_cdev)
 {
-   struct led_classdev *led_cdev = dev_get_drvdata(dev);
struct led_trigger *trig;
-   int len = 0;
+   int len = led_trigger_snprintf(buf, size, query, "%s",
+  led_cdev->trigger ? "none" : "[none]");
+
+   list_for_each_entry(trig, _list, next_trig) {
+   bool hit = led_cdev->trigger &&
+   !strcmp(led_cdev->trigger->name, trig->name);
+
+   len += led_trigger_snprintf(buf + len, size - len, query,
+   " %s%s%s", hit ? "[" : "",
+   trig->name, hit ? "]" : "");
+   }
+
+   len += led_trigger_snprintf(buf + len, size - len, query, "\n");
+
+   return len;
+}
+
+/*
+ * It was stupid to create 1 cpu triggers, but we are stuck with it now.
+ * Don't make that mistake again. We work around it here by creating binary
+ * attribute, which is not limited by length. This is _not_ good design, do not
+ * copy it.
+ */
+ssize_t led_trigger_read(struct file *filp, struct kobject *kobj,
+   struct bin_attribute *attr, char *buf,
+   loff_t pos, size_t count)
+{
+   struct device *dev = kobj_to_dev(kobj);
+   struct led_classdev *led_cdev = dev_get_drvdata(dev);
+   void *data;
+   int len;
 
down_read(_list_lock);
down_read(_cdev->trigger_lock);
 
-   if (!led_cdev->trigger)
-   len += scnprintf(buf+len, PAGE_

Re: [PATCH] leds: remove PAGE_SIZE limit of /sys/class/leds//trigger

2019-09-03 Thread Akinobu Mita

2019年9月3日(火) 23:07 Greg KH :
>
> On Tue, Sep 03, 2019 at 10:55:40PM +0900, Akinobu Mita wrote:
> > 2019年9月3日(火) 4:08 Greg KH :
> > >
> > > On Mon, Sep 02, 2019 at 08:47:02PM +0200, Jacek Anaszewski wrote:
> > > > On 9/2/19 8:12 PM, Greg KH wrote:
> > > > > On Sun, Sep 01, 2019 at 06:53:34PM +0200, Jacek Anaszewski wrote:
> > > > >> Hi Akinobu,
> > > > >>
> > > > >> Thank you for the patch.
> > > > >>
> > > > >> I have one nit below but in general it looks good to me.
> > > > >> I've tested it with 2000 mtd triggers (~14kB file size)
> > > > >> and it worked flawlessly.
> > > > >>
> > > > >> Still, I would like to have ack from Greg for it.
> > > > >>
> > > > >> Adding Greg on Cc.
> > > > >>
> > > > >> On 8/29/19 4:49 PM, Akinobu Mita wrote:
> > > > >>> Reading /sys/class/leds//trigger returns all available LED 
> > > > >>> triggers.
> > > > >>> However, the size of this file is limited to PAGE_SIZE because of 
> > > > >>> the
> > > > >>> limitation for sysfs attribute.
> > > > >>>
> > > > >>> Enabling LED CPU trigger on systems with thousands of CPUs easily 
> > > > >>> hits
> > > > >>> PAGE_SIZE limit, and makes it impossible to see all available LED 
> > > > >>> triggers
> > > > >>> and which trigger is currently activated.
> > > > >>>
> > > > >>> This converts /sys/class/leds//trigger to bin attribute and 
> > > > >>> removes
> > > > >>> the PAGE_SIZE limitation.
> > > > >
> > > > > But this is NOT a binary file.  A sysfs binary file is used for when 
> > > > > the
> > > > > kernel passes data to or from hardware without any parsing of the data
> > > > > by the kernel.
> > > > >
> > > > > You are not doing that here, you are abusing the "one value per file"
> > > > > rule of sysfs so much that you are forced to work around the 
> > > > > limitation
> > > > > it put in place on purpose to keep you from doing stuff like this.
> > > > >
> > > > > Please fix this "correctly" by creating a new api that works properly
> > > > > and just live with the fact that this file will never work correctly 
> > > > > and
> > > > > move everyone to use the new api instead.
> > > > >
> > > > > Don't keep on abusing the interface by workarounds like this, it is 
> > > > > not
> > > > > ok.
> > > >
> > > > In the message [0] you pledged to give us exception for that, provided
> > > > it will be properly documented in the code. I suppose you now object
> > > > because the patch does not meet that condition.
> > >
> > > Well, I honestly don't remember writing that email, but it was 5 months
> > > and many thousands of emails ago :)
> > >
> > > Also, you all didn't document the heck out of this.  So no, I really do
> > > not want to see this patch accepted as-is.
> > >
> > > > Provided that will be fixed, can we count on your ack for the
> > > > implementation of the solution you proposed? :-)
> > >
> > > Let's see the patch that actually implements what I suggested first :)
> >
> > I'd propose introducing a new procfs file (/proc/led-triggers) and new
> > /sys/class/leds//current-trigger api.
> >
> > Reading /proc/led-triggers file shows all available triggers.
> > This violates "one value per file", but it's a procfs file.
>
> No, procfs files are ONLY for process-related things.  Don't keep the
> insanity of this file format by just moving it out of sysfs and into
> procfs :)

I see.

How about creating one file or directory for each led-trigger in
/sys/kernel/led-triggers directory?

e.g.

$ ls /sys/kernel/led-triggers
audio-micmute  ide-diskphy0assoc
audio-mute kbd-altgrlock   phy0radio
...
hidpp_battery_3-full   panic

Re: [PATCH] leds: remove PAGE_SIZE limit of /sys/class/leds//trigger

2019-09-03 Thread Akinobu Mita

2019年9月3日(火) 4:08 Greg KH :
>
> On Mon, Sep 02, 2019 at 08:47:02PM +0200, Jacek Anaszewski wrote:
> > On 9/2/19 8:12 PM, Greg KH wrote:
> > > On Sun, Sep 01, 2019 at 06:53:34PM +0200, Jacek Anaszewski wrote:
> > >> Hi Akinobu,
> > >>
> > >> Thank you for the patch.
> > >>
> > >> I have one nit below but in general it looks good to me.
> > >> I've tested it with 2000 mtd triggers (~14kB file size)
> > >> and it worked flawlessly.
> > >>
> > >> Still, I would like to have ack from Greg for it.
> > >>
> > >> Adding Greg on Cc.
> > >>
> > >> On 8/29/19 4:49 PM, Akinobu Mita wrote:
> > >>> Reading /sys/class/leds//trigger returns all available LED 
> > >>> triggers.
> > >>> However, the size of this file is limited to PAGE_SIZE because of the
> > >>> limitation for sysfs attribute.
> > >>>
> > >>> Enabling LED CPU trigger on systems with thousands of CPUs easily hits
> > >>> PAGE_SIZE limit, and makes it impossible to see all available LED 
> > >>> triggers
> > >>> and which trigger is currently activated.
> > >>>
> > >>> This converts /sys/class/leds//trigger to bin attribute and removes
> > >>> the PAGE_SIZE limitation.
> > >
> > > But this is NOT a binary file.  A sysfs binary file is used for when the
> > > kernel passes data to or from hardware without any parsing of the data
> > > by the kernel.
> > >
> > > You are not doing that here, you are abusing the "one value per file"
> > > rule of sysfs so much that you are forced to work around the limitation
> > > it put in place on purpose to keep you from doing stuff like this.
> > >
> > > Please fix this "correctly" by creating a new api that works properly
> > > and just live with the fact that this file will never work correctly and
> > > move everyone to use the new api instead.
> > >
> > > Don't keep on abusing the interface by workarounds like this, it is not
> > > ok.
> >
> > In the message [0] you pledged to give us exception for that, provided
> > it will be properly documented in the code. I suppose you now object
> > because the patch does not meet that condition.
>
> Well, I honestly don't remember writing that email, but it was 5 months
> and many thousands of emails ago :)
>
> Also, you all didn't document the heck out of this.  So no, I really do
> not want to see this patch accepted as-is.
>
> > Provided that will be fixed, can we count on your ack for the
> > implementation of the solution you proposed? :-)
>
> Let's see the patch that actually implements what I suggested first :)

I'd propose introducing a new procfs file (/proc/led-triggers) and new
/sys/class/leds//current-trigger api.

Reading /proc/led-triggers file shows all available triggers.
This violates "one value per file", but it's a procfs file.

The /sys/class/leds//current-trigger is almost identical to
/sys/class/leds//trigger.  The only difference is that
'current-trigger' only shows the current trigger name.
This file follows the "one value per file" rule of sysfs.

[PATCH] leds: remove PAGE_SIZE limit of /sys/class/leds//trigger

2019-08-29 Thread Akinobu Mita

Reading /sys/class/leds//trigger returns all available LED triggers.
However, the size of this file is limited to PAGE_SIZE because of the
limitation for sysfs attribute.

Enabling LED CPU trigger on systems with thousands of CPUs easily hits
PAGE_SIZE limit, and makes it impossible to see all available LED triggers
and which trigger is currently activated.

This converts /sys/class/leds//trigger to bin attribute and removes
the PAGE_SIZE limitation.

Cc: Jacek Anaszewski 
Cc: Pavel Machek 
Cc: Dan Murphy 
Signed-off-by: Akinobu Mita 
---
 drivers/leds/led-class.c|  8 ++---
 drivers/leds/led-triggers.c | 84 ++---
 drivers/leds/leds.h |  6 
 include/linux/leds.h|  5 ---
 4 files changed, 74 insertions(+), 29 deletions(-)

diff --git a/drivers/leds/led-class.c b/drivers/leds/led-class.c
index 4793e77..8b5a1d1 100644
--- a/drivers/leds/led-class.c
+++ b/drivers/leds/led-class.c
@@ -73,13 +73,13 @@ static ssize_t max_brightness_show(struct device *dev,
 static DEVICE_ATTR_RO(max_brightness);
 
 #ifdef CONFIG_LEDS_TRIGGERS
-static DEVICE_ATTR(trigger, 0644, led_trigger_show, led_trigger_store);
-static struct attribute *led_trigger_attrs[] = {
-   _attr_trigger.attr,
+static BIN_ATTR(trigger, 0644, led_trigger_read, led_trigger_write, 0);
+static struct bin_attribute *led_trigger_bin_attrs[] = {
+   _attr_trigger,
NULL,
 };
 static const struct attribute_group led_trigger_group = {
-   .attrs = led_trigger_attrs,
+   .bin_attrs = led_trigger_bin_attrs,
 };
 #endif
 
diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
index 8d11a5e..4788e00 100644
--- a/drivers/leds/led-triggers.c
+++ b/drivers/leds/led-triggers.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "leds.h"
 
 /*
@@ -26,9 +27,11 @@ LIST_HEAD(trigger_list);
 
  /* Used by LED Class */
 
-ssize_t led_trigger_store(struct device *dev, struct device_attribute *attr,
-   const char *buf, size_t count)
+ssize_t led_trigger_write(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *bin_attr, char *buf,
+ loff_t pos, size_t count)
 {
+   struct device *dev = kobj_to_dev(kobj);
struct led_classdev *led_cdev = dev_get_drvdata(dev);
struct led_trigger *trig;
int ret = count;
@@ -64,39 +67,80 @@ ssize_t led_trigger_store(struct device *dev, struct 
device_attribute *attr,
mutex_unlock(_cdev->led_access);
return ret;
 }
-EXPORT_SYMBOL_GPL(led_trigger_store);
+EXPORT_SYMBOL_GPL(led_trigger_write);
 
-ssize_t led_trigger_show(struct device *dev, struct device_attribute *attr,
-   char *buf)
+__printf(4, 5)
+static int led_trigger_snprintf(char *buf, size_t size, bool query,
+   const char *fmt, ...)
+{
+   va_list args;
+   int i;
+
+   va_start(args, fmt);
+   if (query)
+   i = vsnprintf(NULL, 0, fmt, args);
+   else
+   i = vscnprintf(buf, size, fmt, args);
+   va_end(args);
+
+   return i;
+}
+
+static int led_trigger_format(char *buf, size_t size, bool query,
+ struct led_classdev *led_cdev)
 {
-   struct led_classdev *led_cdev = dev_get_drvdata(dev);
struct led_trigger *trig;
int len = 0;
 
+   len += led_trigger_snprintf(buf + len, size - len, query, "%s",
+   led_cdev->trigger ? "none" : "[none]");
+
+   list_for_each_entry(trig, _list, next_trig) {
+   bool hit = led_cdev->trigger &&
+   !strcmp(led_cdev->trigger->name, trig->name);
+
+   len += led_trigger_snprintf(buf + len, size - len, query,
+   " %s%s%s", hit ? "[" : "",
+   trig->name, hit ? "]" : "");
+   }
+
+   len += led_trigger_snprintf(buf + len, size - len, query, "\n");
+
+   return len;
+}
+
+ssize_t led_trigger_read(struct file *filp, struct kobject *kobj,
+   struct bin_attribute *attr, char *buf,
+   loff_t pos, size_t count)
+{
+   struct device *dev = kobj_to_dev(kobj);
+   struct led_classdev *led_cdev = dev_get_drvdata(dev);
+   void *data;
+   int len;
+
down_read(_list_lock);
down_read(_cdev->trigger_lock);
 
-   if (!led_cdev->trigger)
-   len += scnprintf(buf+len, PAGE_SIZE - len, "[none] ");
+   len = led_trigger_format(NULL, 0, true, led_cdev);
+   data = kvmalloc(len + 1, GFP_KERNEL);
+   if (data)
+   len = led_trigger_format(data, len + 1, false, led_cdev);
else
-   len += scnprintf(buf+len, PAGE_SIZE - len, "none ");
+

[PATCH 2/2] devcoredump: fix typo in comment

2019-07-27 Thread Akinobu Mita

s/dev_coredumpmsg/dev_coredumpsg/

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Cc: Chaitanya Kulkarni 
Reviewed-by: Chaitanya Kulkarni 
Signed-off-by: Akinobu Mita 
---
 drivers/base/devcoredump.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index 3c960a6..e42d0b5 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -314,7 +314,7 @@ void dev_coredumpm(struct device *dev, struct module *owner,
 EXPORT_SYMBOL_GPL(dev_coredumpm);
 
 /**
- * dev_coredumpmsg - create device coredump that uses scatterlist as data
+ * dev_coredumpsg - create device coredump that uses scatterlist as data
  * parameter
  * @dev: the struct device for the crashed device
  * @table: the dump data
-- 
2.7.4

[PATCH 1/2] devcoredump: use memory_read_from_buffer

2019-07-27 Thread Akinobu Mita

Use memory_read_from_buffer() to simplify devcd_readv().

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Cc: Chaitanya Kulkarni 
Reviewed-by: Chaitanya Kulkarni 
Reviewed-by: Johannes Berg 
Signed-off-by: Akinobu Mita 
---
 drivers/base/devcoredump.c | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index f1a3353..3c960a6 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -164,16 +164,7 @@ static struct class devcd_class = {
 static ssize_t devcd_readv(char *buffer, loff_t offset, size_t count,
   void *data, size_t datalen)
 {
-   if (offset > datalen)
-   return -EINVAL;
-
-   if (offset + count > datalen)
-   count = datalen - offset;
-
-   if (count)
-   memcpy(buffer, ((u8 *)data) + offset, count);
-
-   return count;
+   return memory_read_from_buffer(buffer, count, , data, datalen);
 }
 
 static void devcd_freev(void *data)
-- 
2.7.4

[PATCH 0/2] devcoredump: cleanup and typo fix

2019-07-27 Thread Akinobu Mita

These two patches are cleanup and typo fix for device coredump subsystem,
and these were originally a part of nvme device coredump series.  However
the series requires an overhaul because it makes nvme-pci driver
compilcated, so these two independent patches are extracted from the
series.

Akinobu Mita (2):
  devcoredump: use memory_read_from_buffer
  devcoredump: fix typo in comment

 drivers/base/devcoredump.c | 13 ++---
 1 file changed, 2 insertions(+), 11 deletions(-)

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Cc: Chaitanya Kulkarni 
-- 
2.7.4

Re: [PATCH] regmap: select CONFIG_REGMAP while REGMAP_SCCB is set

2019-07-04 Thread Akinobu Mita

2019年7月4日(木) 18:36 YueHaibing :
>
> REGMAP_SCCB is selected by ov772x and ov9650 drivers,
> but CONFIG_REGMAP may not, so building will fails:
>
> rivers/media/i2c/ov772x.c: In function ov772x_probe:
> drivers/media/i2c/ov772x.c:1360:22: error: variable ov772x_regmap_config has 
> initializer but incomplete type
>   static const struct regmap_config ov772x_regmap_config = {
>   ^
> drivers/media/i2c/ov772x.c:1361:4: error: const struct regmap_config has no 
> member named reg_bits
>
> Reported-by: Hulk Robot 
> Fixes: 5bbf32217bf9 ("media: ov772x: use SCCB regmap")
> Signed-off-by: YueHaibing 
> ---
>  drivers/base/regmap/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/base/regmap/Kconfig b/drivers/base/regmap/Kconfig
> index c8bbf53..a498413 100644
> --- a/drivers/base/regmap/Kconfig
> +++ b/drivers/base/regmap/Kconfig
> @@ -4,7 +4,7 @@
>  # subsystems should select the appropriate symbols.
>
>  config REGMAP
> -   default y if (REGMAP_I2C || REGMAP_SPI || REGMAP_SPMI || REGMAP_W1 || 
> REGMAP_AC97 || REGMAP_MMIO || REGMAP_IRQ || REGMAP_I3C)
> +   default y if (REGMAP_I2C || REGMAP_SPI || REGMAP_SPMI || REGMAP_W1 || 
> REGMAP_AC97 || REGMAP_MMIO || REGMAP_IRQ || REGMAP_SCCB || REGMAP_I3C)
> select IRQ_DOMAIN if REGMAP_IRQ
> bool

Looks good.

Reviewed-by: Akinobu Mita 

A similar problem exists for REGMAP_SOUNDWIRE. But I can't find any users
of regmap_init_sdw (i.e. REGMAP_SOUNDWIRE).

Re: [PATCH] fault-inject: clean up debugfs file creation logic

2019-06-13 Thread Akinobu Mita

2019年6月12日(水) 18:58 Greg Kroah-Hartman :
>
> There is no need to check the return value of a debugfs_create_file
> call, a caller should never change what they do depending on if debugfs
> is working properly or not, so remove the checks, simplifying the logic
> in the file a lot.
>
> Also fix up the error check for debugfs_create_dir() which was not
> returning NULL for an error, but rather a error pointer.
>
> Cc: Akinobu Mita 
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Greg Kroah-Hartman 

Looks good.

Reviewed-by: Akinobu Mita

Re: [PATCH] mm/failslab: By default, do not fail allocations with direct reclaim only

2019-05-20 Thread Akinobu Mita

2019年5月20日(月) 13:49 Nicolas Boichat :
>
> When failslab was originally written, the intention of the
> "ignore-gfp-wait" flag default value ("N") was to fail
> GFP_ATOMIC allocations. Those were defined as (__GFP_HIGH),
> and the code would test for __GFP_WAIT (0x10u).
>
> However, since then, __GFP_WAIT was replaced by __GFP_RECLAIM
> (___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM), and GFP_ATOMIC is
> now defined as (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM).
>
> This means that when the flag is false, almost no allocation
> ever fails (as even GFP_ATOMIC allocations contain
> __GFP_KSWAPD_RECLAIM).
>
> Restore the original intent of the code, by ignoring calls
> that directly reclaim only (___GFP_DIRECT_RECLAIM), and thus,
> failing GFP_ATOMIC calls again by default.
>
> Fixes: 71baba4b92dc1fa1 ("mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM")
> Signed-off-by: Nicolas Boichat 

Good catch.

Reviewed-by: Akinobu Mita 

> ---
>  mm/failslab.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/failslab.c b/mm/failslab.c
> index ec5aad211c5be97..33efcb60e633c0a 100644
> --- a/mm/failslab.c
> +++ b/mm/failslab.c
> @@ -23,7 +23,8 @@ bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags)
> if (gfpflags & __GFP_NOFAIL)
> return false;
>
> -   if (failslab.ignore_gfp_reclaim && (gfpflags & __GFP_RECLAIM))
> +   if (failslab.ignore_gfp_reclaim &&
> +   (gfpflags & ___GFP_DIRECT_RECLAIM))
> return false;

Should we use __GFP_DIRECT_RECLAIM instead of ___GFP_DIRECT_RECLAIM?
Because I found the following comment in gfp.h

/* Plain integer GFP bitmasks. Do not use this directly. */

[PATCH v4 0/7] nvme-pci: support device coredump

2019-05-19 Thread Akinobu Mita

This enables to collect snapshot of controller information via device
coredump mechanism.  The nvme device coredump is triggered when command
timeout occurs, and can also be triggered by writing sysfs attribute.

After finishing the nvme device coredump, the following files are created.

 - regs: NVMe controller registers (00h to 4Fh)
 - sq: Submission queue
 - cq: Completion queue
 - telemetry-ctrl-log: Telemetry controller-initiated log (if available)
 - data: Empty

The device coredump mechanism currently allows drivers to create only a
single coredump file, so this also provides a new function that allows
drivers to create several device coredump files in one crashed device.

* v4
- Add Reviewed-by tags
- Add nvme_get_telemetry_log() to nvme core module.
- Copy struct nvme_telemetry_log_page_hdr from the latest nvme-cli
- Use bio_vec instead of sg_table to store telemetry log page
- Make nvme_coredump_logs() return error if the device didn't produce
  a response.
- Abandon the reset if nvme_coredump_logs() returns error code

* v3
- Merge 'add telemetry log page definisions' patch and 'add facility to
  check log page attributes' patch
- Copy struct nvme_telemetry_log_page_hdr from the latest nvme-cli
- Add BUILD_BUG_ON for the size of struct nvme_telemetry_log_page_hdr
- Fix typo s/machanism/mechanism/ in commit log
- Fix max transfer size calculation for get log page
- Add function comments
- Extract 'enable to trigger device coredump by hand' patch
- Don't try to get telemetry log when admin queue is not available
- Avoid deadlock in .coredump callback

* v2
- Add Reviewed-by tag.
- Add patch to fix typo in comment
- Remove unneeded braces.
- Allocate device_entry followed by an array of devcd_file elements.
- Add telemetry log page definisions
- Add facility to check log page attributes
- Exclude the doorbell registers from register dump.
- Save controller registers in a binary format instead of a text format.
- Create an empty 'data' file in the device coredump.
- Save telemetry controller-initiated log if available
- Make coredump procedure into two phases (before resetting controller and
  after resetting as soon as admin queue is available).

Akinobu Mita (7):
  devcoredump: use memory_read_from_buffer
  devcoredump: fix typo in comment
  devcoredump: allow to create several coredump files in one device
  nvme: add basic facilities to get telemetry log page
  nvme-pci: add device coredump infrastructure
  nvme-pci: trigger device coredump on command timeout
  nvme-pci: enable to trigger device coredump by hand

 drivers/base/devcoredump.c  | 168 ++--
 drivers/nvme/host/Kconfig   |   1 +
 drivers/nvme/host/core.c|  59 ++
 drivers/nvme/host/nvme.h|   3 +
 drivers/nvme/host/pci.c | 473 ++--
 include/linux/devcoredump.h |  33 
 include/linux/nvme.h|  32 +++
 7 files changed, 696 insertions(+), 73 deletions(-)

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Cc: Chaitanya Kulkarni 
-- 
2.7.4

[PATCH v4 2/7] devcoredump: fix typo in comment

2019-05-19 Thread Akinobu Mita

s/dev_coredumpmsg/dev_coredumpsg/

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Cc: Chaitanya Kulkarni 
Reviewed-by: Chaitanya Kulkarni 
Signed-off-by: Akinobu Mita 
---
* v4
- Add Reviewed-by tag

 drivers/base/devcoredump.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index 3c960a6..e42d0b5 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -314,7 +314,7 @@ void dev_coredumpm(struct device *dev, struct module *owner,
 EXPORT_SYMBOL_GPL(dev_coredumpm);
 
 /**
- * dev_coredumpmsg - create device coredump that uses scatterlist as data
+ * dev_coredumpsg - create device coredump that uses scatterlist as data
  * parameter
  * @dev: the struct device for the crashed device
  * @table: the dump data
-- 
2.7.4

[PATCH v4 3/7] devcoredump: allow to create several coredump files in one device

2019-05-19 Thread Akinobu Mita

The device coredump mechanism currently allows drivers to create only a
single coredump file.  If there are several binary blobs to dump, we need
to define a binary format or conver to text format in order to put them
into a single coredump file.

This provides a new function that allows drivers to create several device
coredump files in one crashed device.

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Cc: Chaitanya Kulkarni 
Signed-off-by: Akinobu Mita 
---
* v4
- No change since v2

 drivers/base/devcoredump.c  | 155 ++--
 include/linux/devcoredump.h |  33 ++
 2 files changed, 139 insertions(+), 49 deletions(-)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index e42d0b5..4dd6dba 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -25,16 +25,20 @@ static bool devcd_disabled;
 /* if data isn't read by userspace after 5 minutes then delete it */
 #define DEVCD_TIMEOUT  (HZ * 60 * 5)
 
-struct devcd_entry {
-   struct device devcd_dev;
-   void *data;
-   size_t datalen;
-   struct module *owner;
+struct devcd_file {
+   struct bin_attribute bin_attr;
ssize_t (*read)(char *buffer, loff_t offset, size_t count,
void *data, size_t datalen);
void (*free)(void *data);
+};
+
+struct devcd_entry {
+   struct device devcd_dev;
+   struct module *owner;
struct delayed_work del_wk;
struct device *failing_dev;
+   int num_files;
+   struct devcd_file files[];
 };
 
 static struct devcd_entry *dev_to_devcd(struct device *dev)
@@ -45,8 +49,14 @@ static struct devcd_entry *dev_to_devcd(struct device *dev)
 static void devcd_dev_release(struct device *dev)
 {
struct devcd_entry *devcd = dev_to_devcd(dev);
+   int i;
+
+   for (i = 0; i < devcd->num_files; i++) {
+   struct devcd_file *file = >files[i];
+
+   file->free(file->bin_attr.private);
+   }
 
-   devcd->free(devcd->data);
module_put(devcd->owner);
 
/*
@@ -64,9 +74,14 @@ static void devcd_dev_release(struct device *dev)
 static void devcd_del(struct work_struct *wk)
 {
struct devcd_entry *devcd;
+   int i;
 
devcd = container_of(wk, struct devcd_entry, del_wk.work);
 
+   for (i = 0; i < devcd->num_files; i++)
+   device_remove_bin_file(>devcd_dev,
+  >files[i].bin_attr);
+
device_del(>devcd_dev);
put_device(>devcd_dev);
 }
@@ -75,10 +90,11 @@ static ssize_t devcd_data_read(struct file *filp, struct 
kobject *kobj,
   struct bin_attribute *bin_attr,
   char *buffer, loff_t offset, size_t count)
 {
-   struct device *dev = kobj_to_dev(kobj);
-   struct devcd_entry *devcd = dev_to_devcd(dev);
+   struct devcd_file *file =
+   container_of(bin_attr, struct devcd_file, bin_attr);
 
-   return devcd->read(buffer, offset, count, devcd->data, devcd->datalen);
+   return file->read(buffer, offset, count, bin_attr->private,
+ bin_attr->size);
 }
 
 static ssize_t devcd_data_write(struct file *filp, struct kobject *kobj,
@@ -93,25 +109,6 @@ static ssize_t devcd_data_write(struct file *filp, struct 
kobject *kobj,
return count;
 }
 
-static struct bin_attribute devcd_attr_data = {
-   .attr = { .name = "data", .mode = S_IRUSR | S_IWUSR, },
-   .size = 0,
-   .read = devcd_data_read,
-   .write = devcd_data_write,
-};
-
-static struct bin_attribute *devcd_dev_bin_attrs[] = {
-   _attr_data, NULL,
-};
-
-static const struct attribute_group devcd_dev_group = {
-   .bin_attrs = devcd_dev_bin_attrs,
-};
-
-static const struct attribute_group *devcd_dev_groups[] = {
-   _dev_group, NULL,
-};
-
 static int devcd_free(struct device *dev, void *data)
 {
struct devcd_entry *devcd = dev_to_devcd(dev);
@@ -157,7 +154,6 @@ static struct class devcd_class = {
.name   = "devcoredump",
.owner  = THIS_MODULE,
.dev_release= devcd_dev_release,
-   .dev_groups = devcd_dev_groups,
.class_groups   = devcd_class_groups,
 };
 
@@ -234,30 +230,55 @@ static ssize_t devcd_read_from_sgtable(char *buffer, 
loff_t offset,
  offset);
 }
 
+static struct devcd_entry *devcd_alloc(struct dev_coredumpm_bulk_data *files,
+  int num_files, gfp_t gfp)
+{
+   struct devcd_entry *devcd;
+   int i;
+
+   devcd = kzalloc(struct_size(devcd, files, num_files), gfp);
+   if (!devcd)
+   return NULL;
+
+   devcd->num_files = num_files;
+
+   for (i = 0; i < devcd->num_files; i++) {
+

[PATCH v4 5/7] nvme-pci: add device coredump infrastructure

2019-05-19 Thread Akinobu Mita

This provides three functions to implement the device coredump for nvme
driver.

nvme_coredump_init() -  This function is called when the driver determines
to start collecting device coredump.  The snapshots of the controller
registers, and admin and IO queues are captured by this.

nvme_coredump_logs() - This function is called as soon as the device is
recovered from the crash and admin queue becomes available.  If the device
coredump has already been started by nvme_coredump_init(), the telemetry
controller-initiated data will be collected.  Otherwise do nothing.

nvme_coredump_complete() - This functions is called when the driver
determines that there is nothing to collect device coredump anymore.
All collected coredumps are exported via device coredump mechanism.

After finishing the nvme device coredump, the following files are created.

- regs: NVMe controller registers (00h to 4Fh)
- sq: Submission queue
- cq: Completion queue
- telemetry-ctrl-log: Telemetry controller-initiated log (if available)
- data: Empty

The reason for an empty 'data' file is to provide a uniform way to notify
the device coredump is no longer needed by writing the 'data' file.

Since all existing drivers using the device coredump provide a 'data' file
if the nvme device coredump doesn't provide it, the userspace programs need
to know which driver provides what coredump file.

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Cc: Chaitanya Kulkarni 
Signed-off-by: Akinobu Mita 
---
* v4
- Move nvme_get_telemetry_log() to nvme core module.
- Use bio_vec instead of sg_table to store telemetry log page
- Make nvme_coredump_logs() return error if the device didn't produce
  a response.

 drivers/nvme/host/Kconfig |   1 +
 drivers/nvme/host/pci.c   | 425 ++
 2 files changed, 426 insertions(+)

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 0f345e2..c3a06af 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -5,6 +5,7 @@ config BLK_DEV_NVME
tristate "NVM Express block device"
depends on PCI && BLOCK
select NVME_CORE
+   select WANT_DEV_COREDUMP
---help---
  The NVM Express driver is for solid state drives directly
  connected to the PCI or PCI Express bus.  If you know you
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 2a8708c..8a29c52 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -89,6 +90,10 @@ struct nvme_queue;
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);
 
+static void __maybe_unused nvme_coredump_init(struct nvme_dev *dev);
+static int __maybe_unused nvme_coredump_logs(struct nvme_dev *dev);
+static void __maybe_unused nvme_coredump_complete(struct nvme_dev *dev);
+
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
  */
@@ -131,6 +136,9 @@ struct nvme_dev {
dma_addr_t host_mem_descs_dma;
struct nvme_host_mem_buf_desc *host_mem_descs;
void **host_mem_desc_bufs;
+
+   struct dev_coredumpm_bulk_data *dumps;
+   int num_dumps;
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -2849,6 +2857,423 @@ static int nvme_resume(struct device *dev)
 
 static SIMPLE_DEV_PM_OPS(nvme_dev_pm_ops, nvme_suspend, nvme_resume);
 
+#ifdef CONFIG_DEV_COREDUMP
+
+static ssize_t nvme_coredump_read(char *buffer, loff_t offset, size_t count,
+ void *data, size_t datalen)
+{
+   return memory_read_from_buffer(buffer, count, , data, datalen);
+}
+
+static void nvme_coredump_free(void *data)
+{
+   kvfree(data);
+}
+
+static int nvme_coredump_empty(struct dev_coredumpm_bulk_data *data)
+{
+   data->name = kstrdup("data", GFP_KERNEL);
+   if (!data->name)
+   return -ENOMEM;
+
+   data->data = NULL;
+   data->datalen = 0;
+   data->read = nvme_coredump_read;
+   data->free = nvme_coredump_free;
+
+   return 0;
+}
+
+static int nvme_coredump_regs(struct dev_coredumpm_bulk_data *data,
+ struct nvme_ctrl *ctrl)
+{
+   const int reg_size = 0x50; /* 00h to 4Fh */
+
+   data->name = kstrdup("regs", GFP_KERNEL);
+   if (!data->name)
+   return -ENOMEM;
+
+   data->data = kvzalloc(reg_size, GFP_KERNEL);
+   if (!data->data) {
+   kfree(data->name);
+   return -ENOMEM;
+   }
+   memcpy_fromio(data->data, to_nvme_dev(ctrl)->bar, reg_size);
+
+   data->datalen = reg_size;
+   data->read = nvme_coredump_read;
+   data->free = nvme_coredump_free;

[PATCH v4 6/7] nvme-pci: trigger device coredump on command timeout

2019-05-19 Thread Akinobu Mita

This enables the nvme driver to trigger a device coredump when command
timeout occurs, and it helps diagnose and debug issues.

This can be tested with fail_io_timeout fault injection.

# echo 1 > /sys/kernel/debug/fail_io_timeout/probability
# echo 1 > /sys/kernel/debug/fail_io_timeout/times
# echo 1 > /sys/block/nvme0n1/io-timeout-fail
# dd if=/dev/nvme0n1 of=/dev/null

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Cc: Chaitanya Kulkarni 
Signed-off-by: Akinobu Mita 
---
* v4
- Abandon the reset if nvme_coredump_logs() returns error code

 drivers/nvme/host/pci.c | 41 +
 1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8a29c52..6436e72 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -87,12 +87,12 @@ MODULE_PARM_DESC(poll_queues, "Number of queues to use for 
polled IO.");
 struct nvme_dev;
 struct nvme_queue;
 
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool dump);
 static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);
 
-static void __maybe_unused nvme_coredump_init(struct nvme_dev *dev);
-static int __maybe_unused nvme_coredump_logs(struct nvme_dev *dev);
-static void __maybe_unused nvme_coredump_complete(struct nvme_dev *dev);
+static void nvme_coredump_init(struct nvme_dev *dev);
+static int nvme_coredump_logs(struct nvme_dev *dev);
+static void nvme_coredump_complete(struct nvme_dev *dev);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -1280,7 +1280,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
 */
if (nvme_should_reset(dev, csts)) {
nvme_warn_reset(dev, csts);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, true);
nvme_reset_ctrl(>ctrl);
return BLK_EH_DONE;
}
@@ -1310,7 +1310,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
dev_warn_ratelimited(dev->ctrl.device,
 "I/O %d QID %d timeout, disable controller\n",
 req->tag, nvmeq->qid);
-   nvme_dev_disable(dev, shutdown);
+   nvme_dev_disable(dev, shutdown, true);
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
return BLK_EH_DONE;
default:
@@ -1326,7 +1326,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
dev_warn(dev->ctrl.device,
 "I/O %d QID %d timeout, reset controller\n",
 req->tag, nvmeq->qid);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, true);
nvme_reset_ctrl(>ctrl);
 
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
@@ -2382,7 +2382,7 @@ static void nvme_pci_disable(struct nvme_dev *dev)
}
 }
 
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool dump)
 {
bool dead = true;
struct pci_dev *pdev = to_pci_dev(dev->dev);
@@ -2407,6 +2407,9 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
shutdown)
nvme_wait_freeze_timeout(>ctrl, NVME_IO_TIMEOUT);
}
 
+   if (dump)
+   nvme_coredump_init(dev);
+
nvme_stop_queues(>ctrl);
 
if (!dead && dev->ctrl.queue_count > 0) {
@@ -2477,7 +2480,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, 
int status)
dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", 
status);
 
nvme_get_ctrl(>ctrl);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, false);
nvme_kill_queues(>ctrl);
if (!queue_work(nvme_wq, >remove_work))
nvme_put_ctrl(>ctrl);
@@ -2499,7 +2502,7 @@ static void nvme_reset_work(struct work_struct *work)
 * moving on.
 */
if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, false);
 
mutex_lock(>shutdown_lock);
result = nvme_pci_enable(dev);
@@ -2536,6 +2539,11 @@ static void nvme_reset_work(struct work_struct *work)
if (result)
goto out;
 
+   result = nvme_coredump_logs(dev);
+   if (result)
+   goto out;
+   nvme_coredump_complete(dev);
+
if (dev->ctrl.oacs & NVME_CTRL_OACS_SEC_SUPP) {
if (!dev->ctrl.opal_dev)

[PATCH v4 7/7] nvme-pci: enable to trigger device coredump by hand

2019-05-19 Thread Akinobu Mita

This provides a way to trigger the nvme device coredump by writing
anything to /sys/devices/.../coredump attribute.

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Cc: Chaitanya Kulkarni 
Signed-off-by: Akinobu Mita 
---
* v4
- No change since v3

 drivers/nvme/host/pci.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6436e72..04084b9 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3266,6 +3266,14 @@ static void nvme_coredump_complete(struct nvme_dev *dev)
nvme_coredump_clear(dev);
 }
 
+static void nvme_coredump(struct device *dev)
+{
+   struct nvme_dev *ndev = dev_get_drvdata(dev);
+
+   nvme_dev_disable(ndev, false, true);
+   nvme_reset_ctrl_sync(>ctrl);
+}
+
 #else
 
 static void nvme_coredump_init(struct nvme_dev *dev)
@@ -3281,6 +3289,10 @@ static void nvme_coredump_complete(struct nvme_dev *dev)
 {
 }
 
+static void nvme_coredump(struct device *dev)
+{
+}
+
 #endif /* CONFIG_DEV_COREDUMP */
 
 static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
@@ -3388,6 +3400,7 @@ static struct pci_driver nvme_driver = {
.shutdown   = nvme_shutdown,
.driver = {
.pm = _dev_pm_ops,
+   .coredump = nvme_coredump,
},
.sriov_configure = pci_sriov_configure_simple,
.err_handler= _err_handler,
-- 
2.7.4

[PATCH v4 1/7] devcoredump: use memory_read_from_buffer

2019-05-19 Thread Akinobu Mita

Use memory_read_from_buffer() to simplify devcd_readv().

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Cc: Chaitanya Kulkarni 
Reviewed-by: Chaitanya Kulkarni 
Reviewed-by: Johannes Berg 
Signed-off-by: Akinobu Mita 
---
* v4
- Add Reviewed-by tag

 drivers/base/devcoredump.c | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index f1a3353..3c960a6 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -164,16 +164,7 @@ static struct class devcd_class = {
 static ssize_t devcd_readv(char *buffer, loff_t offset, size_t count,
   void *data, size_t datalen)
 {
-   if (offset > datalen)
-   return -EINVAL;
-
-   if (offset + count > datalen)
-   count = datalen - offset;
-
-   if (count)
-   memcpy(buffer, ((u8 *)data) + offset, count);
-
-   return count;
+   return memory_read_from_buffer(buffer, count, , data, datalen);
 }
 
 static void devcd_freev(void *data)
-- 
2.7.4

[PATCH v4 4/7] nvme: add basic facilities to get telemetry log page

2019-05-19 Thread Akinobu Mita

This adds the required facilities to get telemetry log page.  The telemetry
log page structure and identifier are copied from nvme-cli.

We need a facility to check log page attributes in order to know the
controller supports the telemetry log pages and log page offset field for
the Get Log Page command.  The telemetry data area could be larger than
maximum data transfer size, so we may need to split into multiple transfers
with incremental page offset.

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Cc: Chaitanya Kulkarni 
Signed-off-by: Akinobu Mita 
---
* v4
- Add nvme_get_telemetry_log() to nvme core module.
- Copy struct nvme_telemetry_log_page_hdr from the latest nvme-cli

 drivers/nvme/host/core.c | 59 
 drivers/nvme/host/nvme.h |  3 +++
 include/linux/nvme.h | 32 ++
 3 files changed, 94 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7da80f3..d352145 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2464,6 +2464,7 @@ int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 
log_page, u8 lsp,
 
return nvme_submit_sync_cmd(ctrl->admin_q, , log, size);
 }
+EXPORT_SYMBOL_GPL(nvme_get_log);
 
 static int nvme_get_effects_log(struct nvme_ctrl *ctrl)
 {
@@ -2484,6 +2485,62 @@ static int nvme_get_effects_log(struct nvme_ctrl *ctrl)
return ret;
 }
 
+static int nvme_get_log_blocks(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page,
+  u8 lsp, void *buf, size_t bytes, loff_t offset)
+{
+   loff_t pos = 0;
+   u32 chunk_size;
+
+   if (check_mul_overflow(ctrl->max_hw_sectors, 512u, _size))
+   chunk_size = UINT_MAX;
+
+   while (pos < bytes) {
+   size_t size = min_t(size_t, bytes - pos, chunk_size);
+   int ret;
+
+   if ((offset + pos) &&
+   !(ctrl->lpa & NVME_CTRL_LPA_EXTENDED_DATA))
+   return -EINVAL;
+
+   ret = nvme_get_log(ctrl, nsid, log_page, lsp, buf + pos, size,
+  offset + pos);
+   if (ret)
+   return ret;
+
+   pos += size;
+   }
+
+   return 0;
+}
+
+int nvme_get_telemetry_log(struct nvme_ctrl *ctrl, struct bio_vec *bvecs,
+  size_t bytes)
+{
+   struct bvec_iter iter = {
+   .bi_size = bytes,
+   };
+   size_t offset = 0;
+
+   while (iter.bi_size) {
+   struct bio_vec bvec = mp_bvec_iter_bvec(bvecs, iter);
+   size_t size = min(iter.bi_size, bvec.bv_len);
+   void *buf = page_address(bvec.bv_page) + bvec.bv_offset;
+   int ret;
+
+   ret = nvme_get_log_blocks(ctrl, NVME_NSID_ALL,
+ NVME_LOG_TELEMETRY_CTRL, 0, buf, size,
+ offset);
+   if (ret)
+   return ret;
+
+   offset += size;
+   bvec_iter_advance(bvecs, , size);
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(nvme_get_telemetry_log);
+
 /*
  * Initialize the cached copies of the Identify data and various controller
  * register in our nvme_ctrl structure.  This should be called as soon as
@@ -2587,6 +2644,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
} else
ctrl->shutdown_timeout = shutdown_timeout;
 
+   ctrl->lpa = id->lpa;
ctrl->npss = id->npss;
ctrl->apsta = id->apsta;
prev_apst_enabled = ctrl->apst_enabled;
@@ -3899,6 +3957,7 @@ static inline void _nvme_check_size(void)
BUILD_BUG_ON(sizeof(struct nvme_id_ctrl) != NVME_IDENTIFY_DATA_SIZE);
BUILD_BUG_ON(sizeof(struct nvme_id_ns) != NVME_IDENTIFY_DATA_SIZE);
BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
+   BUILD_BUG_ON(sizeof(struct nvme_telemetry_log_page_hdr) != 512);
BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64);
BUILD_BUG_ON(sizeof(struct nvme_directive_cmd) != 64);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 5ee75b5..56bba7a 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -195,6 +195,7 @@ struct nvme_ctrl {
u32 vs;
u32 sgls;
u16 kas;
+   u8 lpa;
u8 npss;
u8 apsta;
u32 oaes;
@@ -466,6 +467,8 @@ int nvme_delete_ctrl(struct nvme_ctrl *ctrl);
 
 int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp,
void *log, size_t size, u64 offset);
+int nvme_get_telemetry_log(struct nvme_ctrl *ctrl, struct bio_vec *bvecs,
+  size_t bytes);
 
 extern const struct attribute_group *nvme_ns_id_attr_groups[];
 extern const struct bloc

Re: [PATCH v3 5/7] nvme-pci: add device coredump infrastructure

2019-05-14 Thread Akinobu Mita

2019年5月14日(火) 0:23 Chaitanya Kulkarni :
>
> On 05/13/2019 12:46 AM, Minwoo Im wrote:
> >> +static int nvme_get_telemetry_log_blocks(struct nvme_ctrl *ctrl, void 
> >> *buf,
> >> + size_t bytes, loff_t offset)
> >> +{
> >> +loff_t pos = 0;
> >> +u32 chunk_size;
> >> +
> >> +if (check_mul_overflow(ctrl->max_hw_sectors, 512u, _size))
> >> +chunk_size = UINT_MAX;
> >> +
> >> +while (pos < bytes) {
> >> +size_t size = min_t(size_t, bytes - pos, chunk_size);
> >> +int ret;
> >> +
> >> +ret = nvme_get_log(ctrl, NVME_NSID_ALL,
> >> NVME_LOG_TELEMETRY_CTRL,
> >> +   0, buf + pos, size, offset + pos);
> >> +if (ret)
> >> +return ret;
> >> +
> >> +pos += size;
> >> +}
> >> +
> >> +return 0;
> >> +}
> >> +
> >> +static int nvme_get_telemetry_log(struct nvme_ctrl *ctrl,
> >> +  struct sg_table *table, size_t bytes)
> >> +{
> >> +int n = sg_nents(table->sgl);
> >> +struct scatterlist *sg;
> >> +size_t offset = 0;
> >> +int i;
> >> +
> A little comment would be nice if you are using sg operations.
> >> +for_each_sg(table->sgl, sg, n, i) {
> >> +struct page *page = sg_page(sg);
> >> +size_t size = min_t(int, bytes - offset, sg->length);
> >> +int ret;
> >> +
> >> +ret = nvme_get_telemetry_log_blocks(ctrl,
> >> page_address(page),
> >> +size, offset);
> >> +if (ret)
> >> +return ret;
> >> +
> >> +offset += size;
> >> +}
> >> +
> >> +return 0;
> >> +}
> >
> > Can we have those two in nvme-core module instead of being in pci module?
>
> Since they are based on the controller they should be moved next to
> nvme_get_log() in the ${KERN_DIR}/drivers/nvme/host/core.c.

OK.  But these functions will be changed to use bio_vec instead of sg in
the next version.

Re: [PATCH v3 4/7] nvme: add basic facility to get telemetry log page

2019-05-14 Thread Akinobu Mita

2019年5月14日(火) 0:34 Chaitanya Kulkarni :
>
> On 05/12/2019 08:55 AM, Akinobu Mita wrote:
> > This adds the required definisions to get telemetry log page.
> s/definisions/definitions/

OK.

> > diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> > index c40720c..8c0b29d 100644
> > --- a/include/linux/nvme.h
> > +++ b/include/linux/nvme.h
> > @@ -294,6 +294,8 @@ enum {
> >   NVME_CTRL_OACS_DIRECTIVES   = 1 << 5,
> >   NVME_CTRL_OACS_DBBUF_SUPP   = 1 << 8,
> >   NVME_CTRL_LPA_CMD_EFFECTS_LOG   = 1 << 1,
> > + NVME_CTRL_LPA_EXTENDED_DATA = 1 << 2,
> > + NVME_CTRL_LPA_TELEMETRY_LOG = 1 << 3,
> >   };
> >
> >   struct nvme_lbaf {
> > @@ -396,6 +398,20 @@ enum {
> >   NVME_NIDT_UUID  = 0x03,
> >   };
> >
> > +struct nvme_telemetry_log_page_hdr {
> > + __u8lpi; /* Log page identifier */
> > + __u8rsvd[4];
> > + __u8iee_oui[3];
> > + __le16  dalb1; /* Data area 1 last block */
> > + __le16  dalb2; /* Data area 2 last block */
> > + __le16  dalb3; /* Data area 3 last block */
> > + __u8rsvd1[368];
> > + __u8ctrlavail; /* Controller initiated data avail?*/
> > + __u8ctrldgn; /* Controller initiated telemetry Data Gen # */
> > + __u8rsnident[128];
> > + __u8telemetry_dataarea[0];
> > +};
> > +
>
> nit:- Thanks for adding the comments, can you please align all the above
> comments like :-

OK.  I'll send a patch for nvme-cli at first.

> +struct nvme_telemetry_log_page_hdr {
> +   __u8lpi;/* Log page identifier */
> +   __u8rsvd[4];
> +   __u8iee_oui[3];
> +   __le16  dalb1;  /* Data area 1 last block */
> +   __le16  dalb2;  /* Data area 2 last block */
> +   __le16  dalb3;  /* Data area 3 last block */
> +   __u8rsvd1[368];
> +   __u8ctrlavail;  /* Controller initiated data avail?*/
> +   __u8ctrldgn;/* Controller initiated telemetry Data
> Gen # */
> +   __u8rsnident[128];
> +   __u8telemetry_dataarea[0];
> +};
> +

Re: [PATCH v3 5/7] nvme-pci: add device coredump infrastructure

2019-05-13 Thread Akinobu Mita

2019年5月13日(月) 23:03 Christoph Hellwig :
>
> Usage of a scatterlist here is rather bogus as we never use
> it for dma mapping.  Why can't you store the various pages in a
> large bio_vec and then just issue that to the device in one
> get log page command?  (or at least a few if MDTS kicks in?)

OK.  I'll try to use bio_vec and see how it goes.

Re: [PATCH v3 5/7] nvme-pci: add device coredump infrastructure

2019-05-13 Thread Akinobu Mita

2019年5月13日(月) 22:55 Keith Busch :
>
> On Sun, May 12, 2019 at 08:54:15AM -0700, Akinobu Mita wrote:
> > +static void nvme_coredump_logs(struct nvme_dev *dev)
> > +{
> > + struct dev_coredumpm_bulk_data *bulk_data;
> > +
> > + if (!dev->dumps)
> > + return;
> > +
> > + bulk_data = nvme_coredump_alloc(dev, 1);
> > + if (!bulk_data)
> > + return;
> > +
> > + if (nvme_coredump_telemetry_log(bulk_data, >ctrl))
> > + dev->num_dumps--;
> > +}
>
> You'll need this function to return the same 'int' value from
> nvme_coredump_telemetry_log. A negative value here means that the
> device didn't produce a response, and that's important to check from
> the reset work since you'll need to abort the reset if that happens.

OK.  Make sense.

Re: [PATCH v3 6/7] nvme-pci: trigger device coredump on command timeout

2019-05-13 Thread Akinobu Mita

2019年5月13日(月) 16:41 Minwoo Im :
>
> > -static void __maybe_unused nvme_coredump_init(struct nvme_dev *dev);
> > -static void __maybe_unused nvme_coredump_logs(struct nvme_dev *dev);
> > -static void __maybe_unused nvme_coredump_complete(struct nvme_dev
> > *dev);
> > +static void nvme_coredump_init(struct nvme_dev *dev);
> > +static void nvme_coredump_logs(struct nvme_dev *dev);
> > +static void nvme_coredump_complete(struct nvme_dev *dev);
>
> You just have added those three prototypes in previous patch.  Did I miss
> something here?

These __maybe_unused are needed only in the patch 5/7.
Because these functions are still unused before applying patch 6/7.

[PATCH v3 3/7] devcoredump: allow to create several coredump files in one device

2019-05-12 Thread Akinobu Mita

The device coredump mechanism currently allows drivers to create only a
single coredump file.  If there are several binary blobs to dump, we need
to define a binary format or conver to text format in order to put them
into a single coredump file.

This provides a new function that allows drivers to create several device
coredump files in one crashed device.

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Signed-off-by: Akinobu Mita 
---
* v3
- No change since v2

 drivers/base/devcoredump.c  | 155 ++--
 include/linux/devcoredump.h |  33 ++
 2 files changed, 139 insertions(+), 49 deletions(-)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index e42d0b5..4dd6dba 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -25,16 +25,20 @@ static bool devcd_disabled;
 /* if data isn't read by userspace after 5 minutes then delete it */
 #define DEVCD_TIMEOUT  (HZ * 60 * 5)
 
-struct devcd_entry {
-   struct device devcd_dev;
-   void *data;
-   size_t datalen;
-   struct module *owner;
+struct devcd_file {
+   struct bin_attribute bin_attr;
ssize_t (*read)(char *buffer, loff_t offset, size_t count,
void *data, size_t datalen);
void (*free)(void *data);
+};
+
+struct devcd_entry {
+   struct device devcd_dev;
+   struct module *owner;
struct delayed_work del_wk;
struct device *failing_dev;
+   int num_files;
+   struct devcd_file files[];
 };
 
 static struct devcd_entry *dev_to_devcd(struct device *dev)
@@ -45,8 +49,14 @@ static struct devcd_entry *dev_to_devcd(struct device *dev)
 static void devcd_dev_release(struct device *dev)
 {
struct devcd_entry *devcd = dev_to_devcd(dev);
+   int i;
+
+   for (i = 0; i < devcd->num_files; i++) {
+   struct devcd_file *file = >files[i];
+
+   file->free(file->bin_attr.private);
+   }
 
-   devcd->free(devcd->data);
module_put(devcd->owner);
 
/*
@@ -64,9 +74,14 @@ static void devcd_dev_release(struct device *dev)
 static void devcd_del(struct work_struct *wk)
 {
struct devcd_entry *devcd;
+   int i;
 
devcd = container_of(wk, struct devcd_entry, del_wk.work);
 
+   for (i = 0; i < devcd->num_files; i++)
+   device_remove_bin_file(>devcd_dev,
+  >files[i].bin_attr);
+
device_del(>devcd_dev);
put_device(>devcd_dev);
 }
@@ -75,10 +90,11 @@ static ssize_t devcd_data_read(struct file *filp, struct 
kobject *kobj,
   struct bin_attribute *bin_attr,
   char *buffer, loff_t offset, size_t count)
 {
-   struct device *dev = kobj_to_dev(kobj);
-   struct devcd_entry *devcd = dev_to_devcd(dev);
+   struct devcd_file *file =
+   container_of(bin_attr, struct devcd_file, bin_attr);
 
-   return devcd->read(buffer, offset, count, devcd->data, devcd->datalen);
+   return file->read(buffer, offset, count, bin_attr->private,
+ bin_attr->size);
 }
 
 static ssize_t devcd_data_write(struct file *filp, struct kobject *kobj,
@@ -93,25 +109,6 @@ static ssize_t devcd_data_write(struct file *filp, struct 
kobject *kobj,
return count;
 }
 
-static struct bin_attribute devcd_attr_data = {
-   .attr = { .name = "data", .mode = S_IRUSR | S_IWUSR, },
-   .size = 0,
-   .read = devcd_data_read,
-   .write = devcd_data_write,
-};
-
-static struct bin_attribute *devcd_dev_bin_attrs[] = {
-   _attr_data, NULL,
-};
-
-static const struct attribute_group devcd_dev_group = {
-   .bin_attrs = devcd_dev_bin_attrs,
-};
-
-static const struct attribute_group *devcd_dev_groups[] = {
-   _dev_group, NULL,
-};
-
 static int devcd_free(struct device *dev, void *data)
 {
struct devcd_entry *devcd = dev_to_devcd(dev);
@@ -157,7 +154,6 @@ static struct class devcd_class = {
.name   = "devcoredump",
.owner  = THIS_MODULE,
.dev_release= devcd_dev_release,
-   .dev_groups = devcd_dev_groups,
.class_groups   = devcd_class_groups,
 };
 
@@ -234,30 +230,55 @@ static ssize_t devcd_read_from_sgtable(char *buffer, 
loff_t offset,
  offset);
 }
 
+static struct devcd_entry *devcd_alloc(struct dev_coredumpm_bulk_data *files,
+  int num_files, gfp_t gfp)
+{
+   struct devcd_entry *devcd;
+   int i;
+
+   devcd = kzalloc(struct_size(devcd, files, num_files), gfp);
+   if (!devcd)
+   return NULL;
+
+   devcd->num_files = num_files;
+
+   for (i = 0; i < devcd->num_files; i++) {
+   struct devcd_file

[PATCH v3 2/7] devcoredump: fix typo in comment

2019-05-12 Thread Akinobu Mita

s/dev_coredumpmsg/dev_coredumpsg/

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Signed-off-by: Akinobu Mita 
---
* v3
- No change since v2

 drivers/base/devcoredump.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index 3c960a6..e42d0b5 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -314,7 +314,7 @@ void dev_coredumpm(struct device *dev, struct module *owner,
 EXPORT_SYMBOL_GPL(dev_coredumpm);
 
 /**
- * dev_coredumpmsg - create device coredump that uses scatterlist as data
+ * dev_coredumpsg - create device coredump that uses scatterlist as data
  * parameter
  * @dev: the struct device for the crashed device
  * @table: the dump data
-- 
2.7.4

[PATCH v3 6/7] nvme-pci: trigger device coredump on command timeout

2019-05-12 Thread Akinobu Mita

This enables the nvme driver to trigger a device coredump when command
timeout occurs, and it helps diagnose and debug issues.

This can be tested with fail_io_timeout fault injection.

# echo 1 > /sys/kernel/debug/fail_io_timeout/probability
# echo 1 > /sys/kernel/debug/fail_io_timeout/times
# echo 1 > /sys/block/nvme0n1/io-timeout-fail
# dd if=/dev/nvme0n1 of=/dev/null

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Signed-off-by: Akinobu Mita 
---
* v3
- Don't try to get telemetry log when admin queue is not available

 drivers/nvme/host/pci.c | 39 +++
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3eebb98..6522592 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -87,12 +87,12 @@ MODULE_PARM_DESC(poll_queues, "Number of queues to use for 
polled IO.");
 struct nvme_dev;
 struct nvme_queue;
 
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool dump);
 static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);
 
-static void __maybe_unused nvme_coredump_init(struct nvme_dev *dev);
-static void __maybe_unused nvme_coredump_logs(struct nvme_dev *dev);
-static void __maybe_unused nvme_coredump_complete(struct nvme_dev *dev);
+static void nvme_coredump_init(struct nvme_dev *dev);
+static void nvme_coredump_logs(struct nvme_dev *dev);
+static void nvme_coredump_complete(struct nvme_dev *dev);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -1280,7 +1280,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
 */
if (nvme_should_reset(dev, csts)) {
nvme_warn_reset(dev, csts);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, true);
nvme_reset_ctrl(>ctrl);
return BLK_EH_DONE;
}
@@ -1309,7 +1309,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
dev_warn_ratelimited(dev->ctrl.device,
 "I/O %d QID %d timeout, disable controller\n",
 req->tag, nvmeq->qid);
-   nvme_dev_disable(dev, shutdown);
+   nvme_dev_disable(dev, shutdown, true);
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
return BLK_EH_DONE;
default:
@@ -1325,7 +1325,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
dev_warn(dev->ctrl.device,
 "I/O %d QID %d timeout, reset controller\n",
 req->tag, nvmeq->qid);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, true);
nvme_reset_ctrl(>ctrl);
 
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
@@ -2382,7 +2382,7 @@ static void nvme_pci_disable(struct nvme_dev *dev)
}
 }
 
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool dump)
 {
bool dead = true;
struct pci_dev *pdev = to_pci_dev(dev->dev);
@@ -2407,6 +2407,9 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
shutdown)
nvme_wait_freeze_timeout(>ctrl, NVME_IO_TIMEOUT);
}
 
+   if (dump)
+   nvme_coredump_init(dev);
+
nvme_stop_queues(>ctrl);
 
if (!dead && dev->ctrl.queue_count > 0) {
@@ -2477,7 +2480,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, 
int status)
dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", 
status);
 
nvme_get_ctrl(>ctrl);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, false);
nvme_kill_queues(>ctrl);
if (!queue_work(nvme_wq, >remove_work))
nvme_put_ctrl(>ctrl);
@@ -2499,7 +2502,7 @@ static void nvme_reset_work(struct work_struct *work)
 * moving on.
 */
if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, false);
 
mutex_lock(>shutdown_lock);
result = nvme_pci_enable(dev);
@@ -2536,6 +2539,9 @@ static void nvme_reset_work(struct work_struct *work)
if (result)
goto out;
 
+   nvme_coredump_logs(dev);
+   nvme_coredump_complete(dev);
+
if (dev->ctrl.oacs & NVME_CTRL_OACS_SEC_SUPP) {
if (!dev->ctrl.opal_dev)
dev->ctrl.opal_dev =
@@ -2598,6 +2604,7 @@ static

[PATCH v3 7/7] nvme-pci: enable to trigger device coredump by hand

2019-05-12 Thread Akinobu Mita

This provides a way to trigger the nvme device coredump by writing
anything to /sys/devices/.../coredump attribute.

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Signed-off-by: Akinobu Mita 
---
* v3
- Extracted from 'add device coredump infrastructure' patch
- Avoid deadlock in .coredump callback

 drivers/nvme/host/pci.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6522592..fad5395 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3288,6 +3288,14 @@ static void nvme_coredump_complete(struct nvme_dev *dev)
nvme_coredump_clear(dev);
 }
 
+static void nvme_coredump(struct device *dev)
+{
+   struct nvme_dev *ndev = dev_get_drvdata(dev);
+
+   nvme_dev_disable(ndev, false, true);
+   nvme_reset_ctrl_sync(>ctrl);
+}
+
 #else
 
 static void nvme_coredump_init(struct nvme_dev *dev)
@@ -3302,6 +3310,10 @@ static void nvme_coredump_complete(struct nvme_dev *dev)
 {
 }
 
+static void nvme_coredump(struct device *dev)
+{
+}
+
 #endif /* CONFIG_DEV_COREDUMP */
 
 static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
@@ -3409,6 +3421,7 @@ static struct pci_driver nvme_driver = {
.shutdown   = nvme_shutdown,
.driver = {
.pm = _dev_pm_ops,
+   .coredump = nvme_coredump,
},
.sriov_configure = pci_sriov_configure_simple,
.err_handler= _err_handler,
-- 
2.7.4

[PATCH v3 5/7] nvme-pci: add device coredump infrastructure

2019-05-12 Thread Akinobu Mita

This provides three functions to implement the device coredump for nvme
driver.

nvme_coredump_init() -  This function is called when the driver determines
to start collecting device coredump.  The snapshots of the controller
registers, and admin and IO queues are captured by this.

nvme_coredump_logs() - This function is called as soon as the device is
recovered from the crash and admin queue becomes available.  If the device
coredump has already been started by nvme_coredump_init(), the telemetry
controller-initiated data will be collected.  Otherwise do nothing.

nvme_coredump_complete() - This functions is called when the driver
determines that there is nothing to collect device coredump anymore.
All collected coredumps are exported via device coredump mechanism.

After finishing the nvme device coredump, the following files are created.

- regs: NVMe controller registers (00h to 4Fh)
- sq: Submission queue
- cq: Completion queue
- telemetry-ctrl-log: Telemetry controller-initiated log (if available)
- data: Empty

The reason for an empty 'data' file is to provide a uniform way to notify
the device coredump is no longer needed by writing the 'data' file.

Since all existing drivers using the device coredump provide a 'data' file
if the nvme device coredump doesn't provide it, the userspace programs need
to know which driver provides what coredump file.

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Signed-off-by: Akinobu Mita 
---
* v3
- Fix typo s/machanism/mechanism/ in commit log
- Fix max transfer size calculation for get log page
- Add function comments
- Extract 'enable to trigger device coredump by hand' patch

 drivers/nvme/host/Kconfig |   1 +
 drivers/nvme/host/core.c  |   1 +
 drivers/nvme/host/pci.c   | 448 ++
 3 files changed, 450 insertions(+)

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 0f345e2..c3a06af 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -5,6 +5,7 @@ config BLK_DEV_NVME
tristate "NVM Express block device"
depends on PCI && BLOCK
select NVME_CORE
+   select WANT_DEV_COREDUMP
---help---
  The NVM Express driver is for solid state drives directly
  connected to the PCI or PCI Express bus.  If you know you
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0cea2a8..172551b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2462,6 +2462,7 @@ int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 
log_page, u8 lsp,
 
return nvme_submit_sync_cmd(ctrl->admin_q, , log, size);
 }
+EXPORT_SYMBOL_GPL(nvme_get_log);
 
 static int nvme_get_effects_log(struct nvme_ctrl *ctrl)
 {
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3e4fb89..3eebb98 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -89,6 +90,10 @@ struct nvme_queue;
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);
 
+static void __maybe_unused nvme_coredump_init(struct nvme_dev *dev);
+static void __maybe_unused nvme_coredump_logs(struct nvme_dev *dev);
+static void __maybe_unused nvme_coredump_complete(struct nvme_dev *dev);
+
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
  */
@@ -131,6 +136,9 @@ struct nvme_dev {
dma_addr_t host_mem_descs_dma;
struct nvme_host_mem_buf_desc *host_mem_descs;
void **host_mem_desc_bufs;
+
+   struct dev_coredumpm_bulk_data *dumps;
+   int num_dumps;
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -2849,6 +2857,446 @@ static int nvme_resume(struct device *dev)
 
 static SIMPLE_DEV_PM_OPS(nvme_dev_pm_ops, nvme_suspend, nvme_resume);
 
+#ifdef CONFIG_DEV_COREDUMP
+
+static ssize_t nvme_coredump_read(char *buffer, loff_t offset, size_t count,
+ void *data, size_t datalen)
+{
+   return memory_read_from_buffer(buffer, count, , data, datalen);
+}
+
+static void nvme_coredump_free(void *data)
+{
+   kvfree(data);
+}
+
+static int nvme_coredump_empty(struct dev_coredumpm_bulk_data *data)
+{
+   data->name = kstrdup("data", GFP_KERNEL);
+   if (!data->name)
+   return -ENOMEM;
+
+   data->data = NULL;
+   data->datalen = 0;
+   data->read = nvme_coredump_read;
+   data->free = nvme_coredump_free;
+
+   return 0;
+}
+
+static int nvme_coredump_regs(struct dev_coredumpm_bulk_data *data,
+ struct nvme_ctrl *ctrl)
+{
+   const int reg_size = 0x50; /* 00h to 4Fh */
+
+   data->name = kstrdup("regs", GFP_KERNEL);
+   if (!data->name

[PATCH v3 4/7] nvme: add basic facility to get telemetry log page

2019-05-12 Thread Akinobu Mita

This adds the required definisions to get telemetry log page.
The telemetry log page structure and identifier are copied from nvme-cli.

We also need a facility to check log page attributes in order to know
the controller supports the telemetry log pages and log page offset field
for the Get Log Page command.  The telemetry data area could be larger
than maximum data transfer size, so we may need to split into multiple
transfers with incremental page offset.

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Signed-off-by: Akinobu Mita 
---
* v3
- Merge 'add telemetry log page definisions' patch and 'add facility to
  check log page attributes' patch
- Copy struct nvme_telemetry_log_page_hdr from the latest nvme-cli
- Add BUILD_BUG_ON for the size of struct nvme_telemetry_log_page_hdr

 drivers/nvme/host/core.c |  2 ++
 drivers/nvme/host/nvme.h |  1 +
 include/linux/nvme.h | 17 +
 3 files changed, 20 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a6644a2..0cea2a8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2585,6 +2585,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
} else
ctrl->shutdown_timeout = shutdown_timeout;
 
+   ctrl->lpa = id->lpa;
ctrl->npss = id->npss;
ctrl->apsta = id->apsta;
prev_apst_enabled = ctrl->apst_enabled;
@@ -3898,6 +3899,7 @@ static inline void _nvme_check_size(void)
BUILD_BUG_ON(sizeof(struct nvme_id_ctrl) != NVME_IDENTIFY_DATA_SIZE);
BUILD_BUG_ON(sizeof(struct nvme_id_ns) != NVME_IDENTIFY_DATA_SIZE);
BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
+   BUILD_BUG_ON(sizeof(struct nvme_telemetry_log_page_hdr) != 512);
BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64);
BUILD_BUG_ON(sizeof(struct nvme_directive_cmd) != 64);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 5ee75b5..7f6f1fc 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -195,6 +195,7 @@ struct nvme_ctrl {
u32 vs;
u32 sgls;
u16 kas;
+   u8 lpa;
u8 npss;
u8 apsta;
u32 oaes;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index c40720c..8c0b29d 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -294,6 +294,8 @@ enum {
NVME_CTRL_OACS_DIRECTIVES   = 1 << 5,
NVME_CTRL_OACS_DBBUF_SUPP   = 1 << 8,
NVME_CTRL_LPA_CMD_EFFECTS_LOG   = 1 << 1,
+   NVME_CTRL_LPA_EXTENDED_DATA = 1 << 2,
+   NVME_CTRL_LPA_TELEMETRY_LOG = 1 << 3,
 };
 
 struct nvme_lbaf {
@@ -396,6 +398,20 @@ enum {
NVME_NIDT_UUID  = 0x03,
 };
 
+struct nvme_telemetry_log_page_hdr {
+   __u8lpi; /* Log page identifier */
+   __u8rsvd[4];
+   __u8iee_oui[3];
+   __le16  dalb1; /* Data area 1 last block */
+   __le16  dalb2; /* Data area 2 last block */
+   __le16  dalb3; /* Data area 3 last block */
+   __u8rsvd1[368];
+   __u8ctrlavail; /* Controller initiated data avail?*/
+   __u8ctrldgn; /* Controller initiated telemetry Data Gen # */
+   __u8rsnident[128];
+   __u8telemetry_dataarea[0];
+};
+
 struct nvme_smart_log {
__u8critical_warning;
__u8temperature[2];
@@ -832,6 +848,7 @@ enum {
NVME_LOG_FW_SLOT= 0x03,
NVME_LOG_CHANGED_NS = 0x04,
NVME_LOG_CMD_EFFECTS= 0x05,
+   NVME_LOG_TELEMETRY_CTRL = 0x08,
NVME_LOG_ANA= 0x0c,
NVME_LOG_DISC   = 0x70,
NVME_LOG_RESERVATION= 0x80,
-- 
2.7.4

[PATCH v3 0/7] nvme-pci: support device coredump

2019-05-12 Thread Akinobu Mita

This enables to collect snapshot of controller information via device
coredump mechanism.  The nvme device coredump is triggered when command
timeout occurs, and can also be triggered by writing sysfs attribute.

After finishing the nvme device coredump, the following files are created.

 - regs: NVMe controller registers (00h to 4Fh)
 - sq: Submission queue
 - cq: Completion queue
 - telemetry-ctrl-log: Telemetry controller-initiated log (if available)
 - data: Empty

The device coredump mechanism currently allows drivers to create only a
single coredump file, so this also provides a new function that allows
drivers to create several device coredump files in one crashed device.

* v3
- Merge 'add telemetry log page definisions' patch and 'add facility to
  check log page attributes' patch
- Copy struct nvme_telemetry_log_page_hdr from the latest nvme-cli
- Add BUILD_BUG_ON for the size of struct nvme_telemetry_log_page_hdr
- Fix typo s/machanism/mechanism/ in commit log
- Fix max transfer size calculation for get log page
- Add function comments
- Extract 'enable to trigger device coredump by hand' patch
- Don't try to get telemetry log when admin queue is not available
- Avoid deadlock in .coredump callback

* v2
- Add Reviewed-by tag.
- Add patch to fix typo in comment
- Remove unneeded braces.
- Allocate device_entry followed by an array of devcd_file elements.
- Add telemetry log page definisions
- Add facility to check log page attributes
- Exclude the doorbell registers from register dump.
- Save controller registers in a binary format instead of a text format.
- Create an empty 'data' file in the device coredump.
- Save telemetry controller-initiated log if available
- Make coredump procedure into two phases (before resetting controller and
  after resetting as soon as admin queue is available).

Akinobu Mita (7):
  devcoredump: use memory_read_from_buffer
  devcoredump: fix typo in comment
  devcoredump: allow to create several coredump files in one device
  nvme: add basic facility to get telemetry log page
  nvme-pci: add device coredump infrastructure
  nvme-pci: trigger device coredump on command timeout
  nvme-pci: enable to trigger device coredump by hand

 drivers/base/devcoredump.c  | 168 +--
 drivers/nvme/host/Kconfig   |   1 +
 drivers/nvme/host/core.c|   3 +
 drivers/nvme/host/nvme.h|   1 +
 drivers/nvme/host/pci.c | 494 ++--
 include/linux/devcoredump.h |  33 +++
 include/linux/nvme.h|  17 ++
 7 files changed, 644 insertions(+), 73 deletions(-)

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
-- 
2.7.4

[PATCH v3 1/7] devcoredump: use memory_read_from_buffer

2019-05-12 Thread Akinobu Mita

Use memory_read_from_buffer() to simplify devcd_readv().

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Cc: Kenneth Heitke 
Reviewed-by: Johannes Berg 
Signed-off-by: Akinobu Mita 
---
* v3
- No change since v2

 drivers/base/devcoredump.c | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index f1a3353..3c960a6 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -164,16 +164,7 @@ static struct class devcd_class = {
 static ssize_t devcd_readv(char *buffer, loff_t offset, size_t count,
   void *data, size_t datalen)
 {
-   if (offset > datalen)
-   return -EINVAL;
-
-   if (offset + count > datalen)
-   count = datalen - offset;
-
-   if (count)
-   memcpy(buffer, ((u8 *)data) + offset, count);
-
-   return count;
+   return memory_read_from_buffer(buffer, count, , data, datalen);
 }
 
 static void devcd_freev(void *data)
-- 
2.7.4

Re: [PATCH v2 6/7] nvme-pci: add device coredump support

2019-05-08 Thread Akinobu Mita

2019年5月8日(水) 9:25 Minwoo Im :
>
> > This is a bit of a mine field. The shutdown_lock is held when reclaiming
> > requests that didn't see a response. If you're holding it here and your
> > telemetry log page times out, we're going to deadlock. And since the
> > controller is probably in a buggered state when you try to retrieve one,
> > I would guess an unrecoverable timeout is the most likely outcome.
>
> Akinobu,
>
> I actually agree with Keith's one.  In my experience, there was always 
> internal
> error inside device when timeout occurs in nvme driver which means the
> following command might not be completed due to lack of response from
> device.

The nvme_coredump() is .coredump() callback of device_driver which is
called when anything is written to the /sys/devices/.../coredump.
Providing this callback is optional, but simply removing this manual
device coredump method is a bit inconvenient.

So instead of directly retrieving the snapshot with the shutdown_lock held
in this callback, I'll change this to just scheduling the reset work, and
the actual device coredump will be triggered by the same procedure that is
implemented in the patch 7/7.  Therefore telemetry log is retrieved only
when the controller is successfully recovered from the crash.

Re: [PATCH v2 6/7] nvme-pci: add device coredump support

2019-05-08 Thread Akinobu Mita

2019年5月8日(水) 6:28 Keith Busch :
>
> On Tue, May 07, 2019 at 02:31:41PM -0600, Heitke, Kenneth wrote:
> > On 5/7/2019 10:58 AM, Akinobu Mita wrote:
> > > +
> > > +static int nvme_get_telemetry_log_blocks(struct nvme_ctrl *ctrl, void 
> > > *buf,
> > > +size_t bytes, loff_t offset)
> > > +{
> > > +   const size_t chunk_size = ctrl->max_hw_sectors * ctrl->page_size;
> >
> > Just curious if chunk_size is correct since page size and block size can
> > be different.
>
> They're always different. ctrl->page_size is hard-coded to 4k, while
> sectors are always 512b.

Oops.  I misunderstood how ctrl->max_hw_sectors is initialized from MDTS.
Also overflow check was required here for the architectures that use
"unsigned int" size_t.

Re: [PATCH v2 4/7] nvme.h: add telemetry log page definisions

2019-05-08 Thread Akinobu Mita

2019年5月8日(水) 2:53 Heitke, Kenneth :
>
>
>
> On 5/7/2019 10:58 AM, Akinobu Mita wrote:
> > Copy telemetry log page definisions from nvme-cli.
> >
> > Cc: Johannes Berg 
> > Cc: Keith Busch 
> > Cc: Jens Axboe 
> > Cc: Christoph Hellwig 
> > Cc: Sagi Grimberg 
> > Cc: Minwoo Im 
> > Signed-off-by: Akinobu Mita 
> > ---
> > * v2
> > - New patch in this version.
> >
> >   include/linux/nvme.h | 23 +++
> >   1 file changed, 23 insertions(+)
> >
> > diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> > index c40720c..5217fe4 100644
> > --- a/include/linux/nvme.h
> > +++ b/include/linux/nvme.h
> > @@ -396,6 +396,28 @@ enum {
> >   NVME_NIDT_UUID  = 0x03,
> >   };
> >
> > +/* Derived from 1.3a Figure 101: Get Log Page – Telemetry Host
> > + * -Initiated Log (Log Identifier 07h)
> > + */
>
> Is this Host Initiated or Controller Initiated? The comment says host
> initiated but everything else seems to indicated controller initiated.

Both telemetry host initiated and controller initiated log headers have
the same structure.  If this comment is confusing, it is also considered
to be removed.

> Is controller initiated even the correct choice because the controller
> would have sent an AER to indicate that the host should pull the
> telemetry data.

It seems useful to retrieve telemetry log continually with the aid of
user space tool reacting an Asynchronous Event.

Similarly, it could be useful to retrieve telemetry log as soon as the
device is successfully recovered from the crash.  (Although I still do
not find the device that has Telemetry Controller-Initiated Data Available
field is set to 1h.)

Re: [PATCH v2 4/7] nvme.h: add telemetry log page definisions

2019-05-08 Thread Akinobu Mita

2019年5月8日(水) 2:28 Heitke, Kenneth :
>
>
>
> On 5/7/2019 10:58 AM, Akinobu Mita wrote:
> > Copy telemetry log page definisions from nvme-cli.
> >
> > Cc: Johannes Berg 
> > Cc: Keith Busch 
> > Cc: Jens Axboe 
> > Cc: Christoph Hellwig 
> > Cc: Sagi Grimberg 
> > Cc: Minwoo Im 
> > Signed-off-by: Akinobu Mita 
> > ---
> > * v2
> > - New patch in this version.
> >
> >   include/linux/nvme.h | 23 +++
> >   1 file changed, 23 insertions(+)
> >
> > diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> > index c40720c..5217fe4 100644
> > --- a/include/linux/nvme.h
> > +++ b/include/linux/nvme.h
> > @@ -396,6 +396,28 @@ enum {
> >   NVME_NIDT_UUID  = 0x03,
> >   };
> >
> > +/* Derived from 1.3a Figure 101: Get Log Page – Telemetry Host
> > + * -Initiated Log (Log Identifier 07h)
> > + */
> > +struct nvme_telemetry_log_page_hdr {
> > + __u8lpi; /* Log page identifier */
> > + __u8rsvd[4];
> > + __u8iee_oui[3];
> > + __le16  dalb1; /* Data area 1 last block */
> > + __le16  dalb2; /* Data area 2 last block */
> > + __le16  dalb3; /* Data area 3 last block */
> > + __u8rsvd1[368]; /* TODO verify */
>
> Remove the TODO

OK.

> > + __u8ctrlavail; /* Controller initiated data avail?*/
> > + __u8ctrldgn; /* Controller initiated telemetry Data Gen # */
> > + __u8rsnident[128];
> > + /* We'll have to double fetch so we can get the header,
> > +  * parse dalb1->3 determine how much size we need for the
> > +  * log then alloc below. Or just do a secondary non-struct
> > +  * allocation.
> > +  */
>
> This comment isn't necessary. You usually can't read the entire
> telemetry log at once and the header is a fixed size. You would likely
> just read the header followed by reads of the different data areas.

This comment is derived from nvme-cli.  So firstly, I'll send a patch
for nvme-cli.  If the changes are accepted, I'll update this comment, too.

> > + __u8telemetry_dataarea[0];
> > +};
> > +
> >   struct nvme_smart_log {
> >   __u8critical_warning;
> >   __u8temperature[2];
> > @@ -832,6 +854,7 @@ enum {
> >   NVME_LOG_FW_SLOT= 0x03,
> >   NVME_LOG_CHANGED_NS = 0x04,
> >   NVME_LOG_CMD_EFFECTS= 0x05,
> > + NVME_LOG_TELEMETRY_CTRL = 0x08,
> >   NVME_LOG_ANA= 0x0c,
> >   NVME_LOG_DISC   = 0x70,
> >   NVME_LOG_RESERVATION= 0x80,
> >

Re: [PATCH v2 3/7] devcoredump: allow to create several coredump files in one device

2019-05-08 Thread Akinobu Mita

2019年5月8日(水) 2:35 Heitke, Kenneth :
>
>
>
> On 5/7/2019 10:58 AM, Akinobu Mita wrote:
> > @@ -292,6 +309,12 @@ void dev_coredumpm(struct device *dev, struct module 
> > *owner,
> >   if (device_add(>devcd_dev))
> >   goto put_device;
> >
> > + for (i = 0; i < devcd->num_files; i++) {
> > + if (device_create_bin_file(>devcd_dev,
> > +>files[i].bin_attr))
> > + /* nothing - some files will be missing */;
>
> Is the conditional necessary if you aren't going to do anything?

The device_create_bin_file() is declared with __must_check, so ignoring
the return value emits warning.

[PATCH v2 2/7] devcoredump: fix typo in comment

2019-05-07 Thread Akinobu Mita

s/dev_coredumpmsg/dev_coredumpsg/

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Signed-off-by: Akinobu Mita 
---
* v2
- New patch in this version.

 drivers/base/devcoredump.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index 3c960a6..e42d0b5 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -314,7 +314,7 @@ void dev_coredumpm(struct device *dev, struct module *owner,
 EXPORT_SYMBOL_GPL(dev_coredumpm);
 
 /**
- * dev_coredumpmsg - create device coredump that uses scatterlist as data
+ * dev_coredumpsg - create device coredump that uses scatterlist as data
  * parameter
  * @dev: the struct device for the crashed device
  * @table: the dump data
-- 
2.7.4

[PATCH v2 0/7] nvme-pci: support device coredump

2019-05-07 Thread Akinobu Mita

This enables to capture snapshot of controller information via device
coredump machanism, and it helps diagnose and debug issues.

The nvme device coredump is triggered when command timeout occurs, and
creates the following coredump files.

 - regs: NVMe controller registers (00h to 4Fh)
 - sq: Submission queue
 - cq: Completion queue
 - telemetry-ctrl-log: Telemetry controller-initiated log (if available)
 - data: Empty

(I don't have the NVMe device that supports telemetry log page for now, so
capturing telemetry log is untested.)

The device coredump mechanism currently allows drivers to create only a
single coredump file, so this also provides a new function that allows
drivers to create several device coredump files in one crashed device.

* v2
- Add Reviewed-by tag.
- Add patch to fix typo in comment
- Remove unneeded braces.
- Allocate device_entry followed by an array of devcd_file elements.
- Add telemetry log page definisions
- Add facility to check log page attributes
- Exclude the doorbell registers from register dump.
- Save controller registers in a binary format instead of a text format.
- Create an empty 'data' file in the device coredump.
- Save telemetry controller-initiated log if available
- Make coredump procedure into two phases (before resetting controller and
  after resetting as soon as admin queue is available).

Akinobu Mita (7):
  devcoredump: use memory_read_from_buffer
  devcoredump: fix typo in comment
  devcoredump: allow to create several coredump files in one device
  nvme.h: add telemetry log page definisions
  nvme: add facility to check log page attributes
  nvme-pci: add device coredump support
  nvme-pci: trigger device coredump on command timeout

 drivers/base/devcoredump.c  | 168 ++--
 drivers/nvme/host/Kconfig   |   1 +
 drivers/nvme/host/core.c|   2 +
 drivers/nvme/host/nvme.h|   1 +
 drivers/nvme/host/pci.c | 460 ++--
 include/linux/devcoredump.h |  33 
 include/linux/nvme.h|  25 +++
 7 files changed, 617 insertions(+), 73 deletions(-)

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
-- 
2.7.4

[PATCH v2 1/7] devcoredump: use memory_read_from_buffer

2019-05-07 Thread Akinobu Mita

Use memory_read_from_buffer() to simplify devcd_readv().

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Reviewed-by: Johannes Berg 
Signed-off-by: Akinobu Mita 
---
* v2
- Add Reviewed-by tag.

 drivers/base/devcoredump.c | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index f1a3353..3c960a6 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -164,16 +164,7 @@ static struct class devcd_class = {
 static ssize_t devcd_readv(char *buffer, loff_t offset, size_t count,
   void *data, size_t datalen)
 {
-   if (offset > datalen)
-   return -EINVAL;
-
-   if (offset + count > datalen)
-   count = datalen - offset;
-
-   if (count)
-   memcpy(buffer, ((u8 *)data) + offset, count);
-
-   return count;
+   return memory_read_from_buffer(buffer, count, , data, datalen);
 }
 
 static void devcd_freev(void *data)
-- 
2.7.4

[PATCH v2 3/7] devcoredump: allow to create several coredump files in one device

2019-05-07 Thread Akinobu Mita

The device coredump mechanism currently allows drivers to create only a
single coredump file.  If there are several binary blobs to dump, we need
to define a binary format or conver to text format in order to put them
into a single coredump file.

This provides a new function that allows drivers to create several device
coredump files in one crashed device.

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Signed-off-by: Akinobu Mita 
---
* v2
- Remove unneeded braces.
- Allocate device_entry followed by an array of devcd_file elements.

 drivers/base/devcoredump.c  | 155 ++--
 include/linux/devcoredump.h |  33 ++
 2 files changed, 139 insertions(+), 49 deletions(-)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index e42d0b5..4dd6dba 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -25,16 +25,20 @@ static bool devcd_disabled;
 /* if data isn't read by userspace after 5 minutes then delete it */
 #define DEVCD_TIMEOUT  (HZ * 60 * 5)
 
-struct devcd_entry {
-   struct device devcd_dev;
-   void *data;
-   size_t datalen;
-   struct module *owner;
+struct devcd_file {
+   struct bin_attribute bin_attr;
ssize_t (*read)(char *buffer, loff_t offset, size_t count,
void *data, size_t datalen);
void (*free)(void *data);
+};
+
+struct devcd_entry {
+   struct device devcd_dev;
+   struct module *owner;
struct delayed_work del_wk;
struct device *failing_dev;
+   int num_files;
+   struct devcd_file files[];
 };
 
 static struct devcd_entry *dev_to_devcd(struct device *dev)
@@ -45,8 +49,14 @@ static struct devcd_entry *dev_to_devcd(struct device *dev)
 static void devcd_dev_release(struct device *dev)
 {
struct devcd_entry *devcd = dev_to_devcd(dev);
+   int i;
+
+   for (i = 0; i < devcd->num_files; i++) {
+   struct devcd_file *file = >files[i];
+
+   file->free(file->bin_attr.private);
+   }
 
-   devcd->free(devcd->data);
module_put(devcd->owner);
 
/*
@@ -64,9 +74,14 @@ static void devcd_dev_release(struct device *dev)
 static void devcd_del(struct work_struct *wk)
 {
struct devcd_entry *devcd;
+   int i;
 
devcd = container_of(wk, struct devcd_entry, del_wk.work);
 
+   for (i = 0; i < devcd->num_files; i++)
+   device_remove_bin_file(>devcd_dev,
+  >files[i].bin_attr);
+
device_del(>devcd_dev);
put_device(>devcd_dev);
 }
@@ -75,10 +90,11 @@ static ssize_t devcd_data_read(struct file *filp, struct 
kobject *kobj,
   struct bin_attribute *bin_attr,
   char *buffer, loff_t offset, size_t count)
 {
-   struct device *dev = kobj_to_dev(kobj);
-   struct devcd_entry *devcd = dev_to_devcd(dev);
+   struct devcd_file *file =
+   container_of(bin_attr, struct devcd_file, bin_attr);
 
-   return devcd->read(buffer, offset, count, devcd->data, devcd->datalen);
+   return file->read(buffer, offset, count, bin_attr->private,
+ bin_attr->size);
 }
 
 static ssize_t devcd_data_write(struct file *filp, struct kobject *kobj,
@@ -93,25 +109,6 @@ static ssize_t devcd_data_write(struct file *filp, struct 
kobject *kobj,
return count;
 }
 
-static struct bin_attribute devcd_attr_data = {
-   .attr = { .name = "data", .mode = S_IRUSR | S_IWUSR, },
-   .size = 0,
-   .read = devcd_data_read,
-   .write = devcd_data_write,
-};
-
-static struct bin_attribute *devcd_dev_bin_attrs[] = {
-   _attr_data, NULL,
-};
-
-static const struct attribute_group devcd_dev_group = {
-   .bin_attrs = devcd_dev_bin_attrs,
-};
-
-static const struct attribute_group *devcd_dev_groups[] = {
-   _dev_group, NULL,
-};
-
 static int devcd_free(struct device *dev, void *data)
 {
struct devcd_entry *devcd = dev_to_devcd(dev);
@@ -157,7 +154,6 @@ static struct class devcd_class = {
.name   = "devcoredump",
.owner  = THIS_MODULE,
.dev_release= devcd_dev_release,
-   .dev_groups = devcd_dev_groups,
.class_groups   = devcd_class_groups,
 };
 
@@ -234,30 +230,55 @@ static ssize_t devcd_read_from_sgtable(char *buffer, 
loff_t offset,
  offset);
 }
 
+static struct devcd_entry *devcd_alloc(struct dev_coredumpm_bulk_data *files,
+  int num_files, gfp_t gfp)
+{
+   struct devcd_entry *devcd;
+   int i;
+
+   devcd = kzalloc(struct_size(devcd, files, num_files), gfp);
+   if (!devcd)
+   return NULL;
+
+   devcd->num_files = num_files;
+
+   for (i =

[PATCH v2 5/7] nvme: add facility to check log page attributes

2019-05-07 Thread Akinobu Mita

This provides a facility to check whether the controller supports the
telemetry log pages and log page offset field for the Get Log Page
command.

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Signed-off-by: Akinobu Mita 
---
* v2
- New patch in this version.

 drivers/nvme/host/core.c | 1 +
 drivers/nvme/host/nvme.h | 1 +
 include/linux/nvme.h | 2 ++
 3 files changed, 4 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6265d92..42f09d6 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2580,6 +2580,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
} else
ctrl->shutdown_timeout = shutdown_timeout;
 
+   ctrl->lpa = id->lpa;
ctrl->npss = id->npss;
ctrl->apsta = id->apsta;
prev_apst_enabled = ctrl->apst_enabled;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 527d645..8711c71 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -195,6 +195,7 @@ struct nvme_ctrl {
u32 vs;
u32 sgls;
u16 kas;
+   u8 lpa;
u8 npss;
u8 apsta;
u32 oaes;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 5217fe4..c1c4ca5 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -294,6 +294,8 @@ enum {
NVME_CTRL_OACS_DIRECTIVES   = 1 << 5,
NVME_CTRL_OACS_DBBUF_SUPP   = 1 << 8,
NVME_CTRL_LPA_CMD_EFFECTS_LOG   = 1 << 1,
+   NVME_CTRL_LPA_EXTENDED_DATA = 1 << 2,
+   NVME_CTRL_LPA_TELEMETRY_LOG = 1 << 3,
 };
 
 struct nvme_lbaf {
-- 
2.7.4

[PATCH v2 4/7] nvme.h: add telemetry log page definisions

2019-05-07 Thread Akinobu Mita

Copy telemetry log page definisions from nvme-cli.

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Signed-off-by: Akinobu Mita 
---
* v2
- New patch in this version.

 include/linux/nvme.h | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index c40720c..5217fe4 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -396,6 +396,28 @@ enum {
NVME_NIDT_UUID  = 0x03,
 };
 
+/* Derived from 1.3a Figure 101: Get Log Page – Telemetry Host
+ * -Initiated Log (Log Identifier 07h)
+ */
+struct nvme_telemetry_log_page_hdr {
+   __u8lpi; /* Log page identifier */
+   __u8rsvd[4];
+   __u8iee_oui[3];
+   __le16  dalb1; /* Data area 1 last block */
+   __le16  dalb2; /* Data area 2 last block */
+   __le16  dalb3; /* Data area 3 last block */
+   __u8rsvd1[368]; /* TODO verify */
+   __u8ctrlavail; /* Controller initiated data avail?*/
+   __u8ctrldgn; /* Controller initiated telemetry Data Gen # */
+   __u8rsnident[128];
+   /* We'll have to double fetch so we can get the header,
+* parse dalb1->3 determine how much size we need for the
+* log then alloc below. Or just do a secondary non-struct
+* allocation.
+*/
+   __u8telemetry_dataarea[0];
+};
+
 struct nvme_smart_log {
__u8critical_warning;
__u8temperature[2];
@@ -832,6 +854,7 @@ enum {
NVME_LOG_FW_SLOT= 0x03,
NVME_LOG_CHANGED_NS = 0x04,
NVME_LOG_CMD_EFFECTS= 0x05,
+   NVME_LOG_TELEMETRY_CTRL = 0x08,
NVME_LOG_ANA= 0x0c,
NVME_LOG_DISC   = 0x70,
NVME_LOG_RESERVATION= 0x80,
-- 
2.7.4

[PATCH v2 7/7] nvme-pci: trigger device coredump on command timeout

2019-05-07 Thread Akinobu Mita

This enables the nvme driver to trigger a device coredump when command
timeout occurs, and it helps diagnose and debug issues.

This can be tested with fail_io_timeout fault injection.

# echo 1 > /sys/kernel/debug/fail_io_timeout/probability
# echo 1 > /sys/kernel/debug/fail_io_timeout/times
# echo 1 > /sys/block/nvme0n1/io-timeout-fail
# dd if=/dev/nvme0n1 of=/dev/null

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Signed-off-by: Akinobu Mita 
---
- Make coredump procedure into two phases (before resetting controller and
  after resetting as soon as admin queue is available).

 drivers/nvme/host/pci.c | 35 ++-
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4684a86..4ff918f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -87,9 +87,12 @@ MODULE_PARM_DESC(poll_queues, "Number of queues to use for 
polled IO.");
 struct nvme_dev;
 struct nvme_queue;
 
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool dump);
 static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);
 
+static int nvme_coredump_prologue(struct nvme_dev *dev);
+static void nvme_coredump_epilogue(struct nvme_dev *dev);
+
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
  */
@@ -1289,7 +1292,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
 */
if (nvme_should_reset(dev, csts)) {
nvme_warn_reset(dev, csts);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, true);
nvme_reset_ctrl(>ctrl);
return BLK_EH_DONE;
}
@@ -1316,7 +1319,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
dev_warn_ratelimited(dev->ctrl.device,
 "I/O %d QID %d timeout, disable controller\n",
 req->tag, nvmeq->qid);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, true);
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
return BLK_EH_DONE;
default:
@@ -1332,7 +1335,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
dev_warn(dev->ctrl.device,
 "I/O %d QID %d timeout, reset controller\n",
 req->tag, nvmeq->qid);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, true);
nvme_reset_ctrl(>ctrl);
 
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
@@ -2399,7 +2402,7 @@ static void nvme_pci_disable(struct nvme_dev *dev)
}
 }
 
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool dump)
 {
bool dead = true;
struct pci_dev *pdev = to_pci_dev(dev->dev);
@@ -2424,6 +2427,9 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
shutdown)
nvme_wait_freeze_timeout(>ctrl, NVME_IO_TIMEOUT);
}
 
+   if (dump)
+   nvme_coredump_prologue(dev);
+
nvme_stop_queues(>ctrl);
 
if (!dead && dev->ctrl.queue_count > 0) {
@@ -2491,7 +2497,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, 
int status)
dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", 
status);
 
nvme_get_ctrl(>ctrl);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, false);
nvme_kill_queues(>ctrl);
if (!queue_work(nvme_wq, >remove_work))
nvme_put_ctrl(>ctrl);
@@ -2513,7 +2519,7 @@ static void nvme_reset_work(struct work_struct *work)
 * moving on.
 */
if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, false);
 
mutex_lock(>shutdown_lock);
result = nvme_pci_enable(dev);
@@ -2550,6 +2556,8 @@ static void nvme_reset_work(struct work_struct *work)
if (result)
goto out;
 
+   nvme_coredump_epilogue(dev);
+
if (dev->ctrl.oacs & NVME_CTRL_OACS_SEC_SUPP) {
if (!dev->ctrl.opal_dev)
dev->ctrl.opal_dev =
@@ -2612,6 +2620,7 @@ static void nvme_reset_work(struct work_struct *work)
  out_unlock:
mutex_unlock(>shutdown_lock);
  out:
+   nvme_coredump_epilogue(dev);
nvme_remove_dead_ctrl(dev, result);
 }
 
@@ -2802,7 +2811,7 @@ static int nvme_probe(struct pci_dev *pdev

[PATCH v2 6/7] nvme-pci: add device coredump support

2019-05-07 Thread Akinobu Mita

This enables to capture snapshot of controller information via device
coredump machanism.

The nvme device coredump creates the following coredump files.

- regs: NVMe controller registers (00h to 4Fh)
- sq: Submission queue
- cq: Completion queue
- telemetry-ctrl-log: Telemetry controller-initiated log (if available)
- data: Empty

The reason for an empty 'data' file is to provide a uniform way to notify
the device coredump is no longer needed by writing the 'data' file.

Since all existing drivers using the device coredump provide a 'data' file
if the nvme device coredump doesn't provide it, the userspace programs need
to know which driver provides what coredump file.

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Minwoo Im 
Signed-off-by: Akinobu Mita 
---
* v2
- Exclude the doorbell registers from register dump.
- Save controller registers in a binary format instead of a text format.
- Create an empty 'data' file in the device coredump.
- Save telemetry controller-initiated log if available

 drivers/nvme/host/Kconfig |   1 +
 drivers/nvme/host/core.c  |   1 +
 drivers/nvme/host/pci.c   | 425 ++
 3 files changed, 427 insertions(+)

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 0f345e2..c3a06af 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -5,6 +5,7 @@ config BLK_DEV_NVME
tristate "NVM Express block device"
depends on PCI && BLOCK
select NVME_CORE
+   select WANT_DEV_COREDUMP
---help---
  The NVM Express driver is for solid state drives directly
  connected to the PCI or PCI Express bus.  If you know you
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 42f09d6..8d297c7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2457,6 +2457,7 @@ int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 
log_page, u8 lsp,
 
return nvme_submit_sync_cmd(ctrl->admin_q, , log, size);
 }
+EXPORT_SYMBOL_GPL(nvme_get_log);
 
 static int nvme_get_effects_log(struct nvme_ctrl *ctrl)
 {
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a90cf5d..4684a86 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -131,6 +132,9 @@ struct nvme_dev {
dma_addr_t host_mem_descs_dma;
struct nvme_host_mem_buf_desc *host_mem_descs;
void **host_mem_desc_bufs;
+
+   struct dev_coredumpm_bulk_data *dumps;
+   int num_dumps;
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -2867,6 +2871,426 @@ static int nvme_resume(struct device *dev)
 
 static SIMPLE_DEV_PM_OPS(nvme_dev_pm_ops, nvme_suspend, nvme_resume);
 
+#ifdef CONFIG_DEV_COREDUMP
+
+static ssize_t nvme_coredump_read(char *buffer, loff_t offset, size_t count,
+ void *data, size_t datalen)
+{
+   return memory_read_from_buffer(buffer, count, , data, datalen);
+}
+
+static void nvme_coredump_free(void *data)
+{
+   kvfree(data);
+}
+
+static int nvme_coredump_empty(struct dev_coredumpm_bulk_data *data)
+{
+   data->name = kstrdup("data", GFP_KERNEL);
+   if (!data->name)
+   return -ENOMEM;
+
+   data->data = NULL;
+   data->datalen = 0;
+   data->read = nvme_coredump_read;
+   data->free = nvme_coredump_free;
+
+   return 0;
+}
+
+static int nvme_coredump_regs(struct dev_coredumpm_bulk_data *data,
+ struct nvme_ctrl *ctrl)
+{
+   const int reg_size = 0x50; /* 00h to 4Fh */
+
+   data->name = kstrdup("regs", GFP_KERNEL);
+   if (!data->name)
+   return -ENOMEM;
+
+   data->data = kvzalloc(reg_size, GFP_KERNEL);
+   if (!data->data) {
+   kfree(data->name);
+   return -ENOMEM;
+   }
+   memcpy_fromio(data->data, to_nvme_dev(ctrl)->bar, reg_size);
+
+   data->datalen = reg_size;
+   data->read = nvme_coredump_read;
+   data->free = nvme_coredump_free;
+
+   return 0;
+}
+
+static void *kvmemdup(const void *src, size_t len, gfp_t gfp)
+{
+   void *p;
+
+   p = kvmalloc(len, gfp);
+   if (p)
+   memcpy(p, src, len);
+
+   return p;
+}
+
+static int nvme_coredump_queues(struct dev_coredumpm_bulk_data *bulk_data,
+   struct nvme_ctrl *ctrl)
+{
+   int i;
+
+   for (i = 0; i < ctrl->queue_count; i++) {
+   struct dev_coredumpm_bulk_data *data = _data[2 * i];
+   struct nvme_queue *nvmeq = _nvme_dev(ctrl)->queues[i];
+
+   data[0].name = kasprintf(GFP_KERNEL, "sq%d", i);
+   data[0].data = kvmemdup(nvmeq->sq_cmds,
+

Re: [PATCH 0/4] nvme-pci: support device coredump

2019-05-04 Thread Akinobu Mita

2019年5月4日(土) 18:40 Minwoo Im :
>
> Hi Akinobu,
>
> On 5/4/19 1:20 PM, Akinobu Mita wrote:
> > 2019年5月3日(金) 21:20 Christoph Hellwig :
> >>
> >> On Fri, May 03, 2019 at 06:12:32AM -0600, Keith Busch wrote:
> >>> Could you actually explain how the rest is useful? I personally have
> >>> never encountered an issue where knowing these values would have helped:
> >>> every device timeout always needed device specific internal firmware
> >>> logs in my experience.
> >
> > I agree that the device specific internal logs like telemetry are the most
> > useful.  The memory dump of command queues and completion queues is not
> > that powerful but helps to know what commands have been submitted before
> > the controller goes wrong (IOW, it's sometimes not enough to know
> > which commands are actually failed), and it can be parsed without vendor
> > specific knowledge.
>
> I'm not pretty sure I can say that memory dump of queues are useless at all.
>
> As you mentioned, sometimes it's not enough to know which command has
> actually been failed because we might want to know what happened before and
> after the actual failure.
>
> But, the information of commands handled from device inside would be much
> more useful to figure out what happened because in case of multiple queues,
> the arbitration among them could not be represented by this memory dump.

Correct.

> > If the issue is reproducible, the nvme trace is the most powerful for this
> > kind of information.  The memory dump of the queues is not that powerful,
> > but it can always be enabled by default.
>
> If the memory dump is a key to reproduce some issues, then it will be
> powerful
> to hand it to a vendor to solve it.  But I'm afraid of it because the
> dump might
> not be able to give relative submitted times among the commands in queues.

I agree that only the memory dump of queues don't help much to reproduce
issues.  However when analyzing the customer-side issues, we would like to
know whether unusual commands have been issued before crash, especially on
admin queue.

> >> Yes.  Also not that NVMe now has the 'device initiated telemetry'
> >> feauture, which is just a wired name for device coredump.  Wiring that
> >> up so that we can easily provide that data to the device vendor would
> >> actually be pretty useful.
> >
> > This version of nvme coredump captures controller registers and each queue.
> > So before resetting controller is a suitable time to capture these.
> > If we'll capture other log pages in this mechanism, the coredump procedure
> > will be splitted into two phases (before resetting controller and after
> > resetting as soon as admin queue is available).
>
> I agree with that it would be nice if we have a information that might not
> be that powerful rather than nothing.
>
> But, could we request controller-initiated telemetry log page if
> supported by
> the controller to get the internal information at the point of failure
> like reset?
> If the dump is generated with the telemetry log page, I think it would
> be great
> to be a clue to solve the issue.

OK.  Let me try it in the next version.

Re: [PATCH 3/4] nvme-pci: add device coredump support

2019-05-04 Thread Akinobu Mita

2019年5月4日(土) 19:04 Minwoo Im :
>
> Hi, Akinobu,
>
> Regardless to reply of the cover, few nits here.
>
> On 5/2/19 5:59 PM, Akinobu Mita wrote:
> > +
> > +static const struct nvme_reg nvme_regs[] = {
> > + { NVME_REG_CAP, "cap",  64 },
> > + { NVME_REG_VS,  "version",  32 },
>
> Why don't we just go with "vs" instead of full name of it just like
> the others.

I tried to imitate the output of 'nvme show-regs'.

> > + { NVME_REG_INTMS,   "intms",32 },
> > + { NVME_REG_INTMC,   "intmc",32 },
> > + { NVME_REG_CC,  "cc",   32 },
> > + { NVME_REG_CSTS,"csts", 32 },
> > + { NVME_REG_NSSR,"nssr", 32 },
> > + { NVME_REG_AQA, "aqa",  32 },
> > + { NVME_REG_ASQ, "asq",  64 },
> > + { NVME_REG_ACQ, "acq",  64 },
> > + { NVME_REG_CMBLOC,  "cmbloc",   32 },
> > + { NVME_REG_CMBSZ,   "cmbsz",32 },
>
> If it's going to support optional registers also, then we can have
> BP-related things (BPINFO, BPRSEL, BPMBL) here also.

I'm going to change the register dump in binary format just like
'nvme show-regs -o binary' does.  So we'll have registers from 00h to 4Fh.

Re: [PATCH 0/4] nvme-pci: support device coredump

2019-05-03 Thread Akinobu Mita

2019年5月3日(金) 21:20 Christoph Hellwig :
>
> On Fri, May 03, 2019 at 06:12:32AM -0600, Keith Busch wrote:
> > Could you actually explain how the rest is useful? I personally have
> > never encountered an issue where knowing these values would have helped:
> > every device timeout always needed device specific internal firmware
> > logs in my experience.

I agree that the device specific internal logs like telemetry are the most
useful.  The memory dump of command queues and completion queues is not
that powerful but helps to know what commands have been submitted before
the controller goes wrong (IOW, it's sometimes not enough to know
which commands are actually failed), and it can be parsed without vendor
specific knowledge.

If the issue is reproducible, the nvme trace is the most powerful for this
kind of information.  The memory dump of the queues is not that powerful,
but it can always be enabled by default.

> Yes.  Also not that NVMe now has the 'device initiated telemetry'
> feauture, which is just a wired name for device coredump.  Wiring that
> up so that we can easily provide that data to the device vendor would
> actually be pretty useful.

This version of nvme coredump captures controller registers and each queue.
So before resetting controller is a suitable time to capture these.
If we'll capture other log pages in this mechanism, the coredump procedure
will be splitted into two phases (before resetting controller and after
resetting as soon as admin queue is available).

Re: [PATCH 2/4] devcoredump: allow to create several coredump files in one device

2019-05-02 Thread Akinobu Mita

2019年5月2日(木) 21:47 Johannes Berg :
>
> On Thu, 2019-05-02 at 17:59 +0900, Akinobu Mita wrote:
> >
> >  static void devcd_del(struct work_struct *wk)
> >  {
> >   struct devcd_entry *devcd;
> > + int i;
> >
> >   devcd = container_of(wk, struct devcd_entry, del_wk.work);
> >
> > + for (i = 0; i < devcd->num_files; i++) {
> > + device_remove_bin_file(>devcd_dev,
> > +>files[i].bin_attr);
> > + }
>
> Not much value in the braces?

OK.  I tend to use braces where a single statement but multiple lines.

> > +static struct devcd_entry *devcd_alloc(struct dev_coredumpm_bulk_data 
> > *files,
> > +int num_files, gfp_t gfp)
> > +{
> > + struct devcd_entry *devcd;
> > + int i;
> > +
> > + devcd = kzalloc(sizeof(*devcd), gfp);
> > + if (!devcd)
> > + return NULL;
> > +
> > + devcd->files = kcalloc(num_files, sizeof(devcd->files[0]), gfp);
> > + if (!devcd->files) {
> > + kfree(devcd);
> > + return NULL;
> > + }
> > + devcd->num_files = num_files;
>
> IMHO it would be nicer to allocate all of this in one struct, i.e. have
>
> struct devcd_entry {
> ...
> struct devcd_file files[];
> }
>
> (and then use struct_size())

Sounds good.

> > @@ -309,7 +339,41 @@ void dev_coredumpm(struct device *dev, struct module 
> > *owner,
> >   put_module:
> >   module_put(owner);
> >   free:
> > - free(data);
> > + for (i = 0; i < num_files; i++)
> > + files[i].free(files[i].data);
> > +}
>
> and then you don't need to do all this kind of thing to free
>
> Otherwise looks fine. I'd worry a bit that existing userspace will only
> capture the 'data' file, rather than a tarball of all files, but I guess
> that's something you'd have to work out then when actually desiring to
> use multiple files.

Your worrying is correct.  I'm going to create a empty 'data' file for nvme
coredump.  Assuming that devcd* always contains the 'data' file at least,
we can simply write to 'data' when the device coredump is no longer needed,
and prepare for the newer coredump.

Re: [PATCH 0/4] nvme-pci: support device coredump

2019-05-02 Thread Akinobu Mita

2019年5月2日(木) 22:03 Keith Busch :
>
> On Thu, May 02, 2019 at 05:59:17PM +0900, Akinobu Mita wrote:
> > This enables to capture snapshot of controller information via device
> > coredump machanism, and it helps diagnose and debug issues.
> >
> > The nvme device coredump is triggered before resetting the controller
> > caused by I/O timeout, and creates the following coredump files.
> >
> > - regs: NVMe controller registers, including each I/O queue doorbell
> > registers, in nvme-show-regs style text format.
>
> You're supposed to treat queue doorbells as write-only. Spec says:
>
>   The host should not read the doorbell registers. If a doorbell register
>   is read, the value returned is vendor specific.

OK.  I'll exclude the doorbell registers from register dump.  It will work
out without the information if we have snapshot of the queues.

[PATCH 4/4] nvme-pci: trigger device coredump before resetting controller

2019-05-02 Thread Akinobu Mita

This enables the nvme driver to trigger a device coredump before resetting
the controller caused by I/O timeout.

The device coredump helps diagnose and debug issues.

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Signed-off-by: Akinobu Mita 
---
 drivers/nvme/host/pci.c | 31 ++-
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7f3077c..584c2aa 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -87,7 +87,7 @@ MODULE_PARM_DESC(poll_queues, "Number of queues to use for 
polled IO.");
 struct nvme_dev;
 struct nvme_queue;
 
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool dump);
 static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);
 
 /*
@@ -1286,7 +1286,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
 */
if (nvme_should_reset(dev, csts)) {
nvme_warn_reset(dev, csts);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, true);
nvme_reset_ctrl(>ctrl);
return BLK_EH_DONE;
}
@@ -1313,7 +1313,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
dev_warn_ratelimited(dev->ctrl.device,
 "I/O %d QID %d timeout, disable controller\n",
 req->tag, nvmeq->qid);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, true);
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
return BLK_EH_DONE;
default:
@@ -1329,7 +1329,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
dev_warn(dev->ctrl.device,
 "I/O %d QID %d timeout, reset controller\n",
 req->tag, nvmeq->qid);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, true);
nvme_reset_ctrl(>ctrl);
 
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
@@ -2396,7 +2396,9 @@ static void nvme_pci_disable(struct nvme_dev *dev)
}
 }
 
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
+static void nvme_coredump(struct device *dev);
+
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool dump)
 {
bool dead = true;
struct pci_dev *pdev = to_pci_dev(dev->dev);
@@ -2421,6 +2423,9 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
shutdown)
nvme_wait_freeze_timeout(>ctrl, NVME_IO_TIMEOUT);
}
 
+   if (dump)
+   nvme_coredump(dev->dev);
+
nvme_stop_queues(>ctrl);
 
if (!dead && dev->ctrl.queue_count > 0) {
@@ -2488,7 +2493,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, 
int status)
dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", 
status);
 
nvme_get_ctrl(>ctrl);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, false);
nvme_kill_queues(>ctrl);
if (!queue_work(nvme_wq, >remove_work))
nvme_put_ctrl(>ctrl);
@@ -2510,7 +2515,7 @@ static void nvme_reset_work(struct work_struct *work)
 * moving on.
 */
if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, false);
 
mutex_lock(>shutdown_lock);
result = nvme_pci_enable(dev);
@@ -2799,7 +2804,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct 
pci_device_id *id)
 static void nvme_reset_prepare(struct pci_dev *pdev)
 {
struct nvme_dev *dev = pci_get_drvdata(pdev);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, false);
 }
 
 static void nvme_reset_done(struct pci_dev *pdev)
@@ -2811,7 +2816,7 @@ static void nvme_reset_done(struct pci_dev *pdev)
 static void nvme_shutdown(struct pci_dev *pdev)
 {
struct nvme_dev *dev = pci_get_drvdata(pdev);
-   nvme_dev_disable(dev, true);
+   nvme_dev_disable(dev, true, false);
 }
 
 /*
@@ -2828,14 +2833,14 @@ static void nvme_remove(struct pci_dev *pdev)
 
if (!pci_device_is_present(pdev)) {
nvme_change_ctrl_state(>ctrl, NVME_CTRL_DEAD);
-   nvme_dev_disable(dev, true);
+   nvme_dev_disable(dev, true, false);
nvme_dev_remove_admin(dev);
}
 
flush_work(>ctrl.reset_work);
nvme_stop_ctrl(>ctrl);
nvme_remove_namespaces(>ctrl);
-   nvme_dev_disable(dev, true);
+   nvme_dev_disable(dev, true, false);
nvme_release_cmb(d

[PATCH 1/4] devcoredump: use memory_read_from_buffer

2019-05-02 Thread Akinobu Mita

Use memory_read_from_buffer() to simplify devcd_readv().

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Signed-off-by: Akinobu Mita 
---
 drivers/base/devcoredump.c | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index f1a3353..3c960a6 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -164,16 +164,7 @@ static struct class devcd_class = {
 static ssize_t devcd_readv(char *buffer, loff_t offset, size_t count,
   void *data, size_t datalen)
 {
-   if (offset > datalen)
-   return -EINVAL;
-
-   if (offset + count > datalen)
-   count = datalen - offset;
-
-   if (count)
-   memcpy(buffer, ((u8 *)data) + offset, count);
-
-   return count;
+   return memory_read_from_buffer(buffer, count, , data, datalen);
 }
 
 static void devcd_freev(void *data)
-- 
2.7.4

[PATCH 3/4] nvme-pci: add device coredump support

2019-05-02 Thread Akinobu Mita

This enables to capture snapshot of controller information via device
coredump machanism.

The nvme device coredump creates the following coredump files.

- regs: NVMe controller registers, including each I/O queue doorbell
registers, in nvme-show-regs style text format.

- sq: I/O submission queue

- cq: I/O completion queue

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Signed-off-by: Akinobu Mita 
---
 drivers/nvme/host/Kconfig |   1 +
 drivers/nvme/host/pci.c   | 221 ++
 2 files changed, 222 insertions(+)

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 0f345e2..c3a06af 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -5,6 +5,7 @@ config BLK_DEV_NVME
tristate "NVM Express block device"
depends on PCI && BLOCK
select NVME_CORE
+   select WANT_DEV_COREDUMP
---help---
  The NVM Express driver is for solid state drives directly
  connected to the PCI or PCI Express bus.  If you know you
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a90cf5d..7f3077c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2867,6 +2868,225 @@ static int nvme_resume(struct device *dev)
 
 static SIMPLE_DEV_PM_OPS(nvme_dev_pm_ops, nvme_suspend, nvme_resume);
 
+#ifdef CONFIG_DEV_COREDUMP
+
+struct nvme_reg {
+   u32 off;
+   const char *name;
+   int bits;
+};
+
+static const struct nvme_reg nvme_regs[] = {
+   { NVME_REG_CAP, "cap",  64 },
+   { NVME_REG_VS,  "version",  32 },
+   { NVME_REG_INTMS,   "intms",32 },
+   { NVME_REG_INTMC,   "intmc",32 },
+   { NVME_REG_CC,  "cc",   32 },
+   { NVME_REG_CSTS,"csts", 32 },
+   { NVME_REG_NSSR,"nssr", 32 },
+   { NVME_REG_AQA, "aqa",  32 },
+   { NVME_REG_ASQ, "asq",  64 },
+   { NVME_REG_ACQ, "acq",  64 },
+   { NVME_REG_CMBLOC,  "cmbloc",   32 },
+   { NVME_REG_CMBSZ,   "cmbsz",32 },
+};
+
+static int nvme_coredump_regs_padding(int num_queues)
+{
+   char name[16];
+   int padding;
+   int i;
+
+   padding = sprintf(name, "sq%dtdbl", num_queues - 1);
+
+   for (i = 0; i < ARRAY_SIZE(nvme_regs); i++)
+   padding = max_t(int, padding, strlen(nvme_regs[i].name));
+
+   return padding;
+}
+
+static int nvme_coredump_regs_buf_size(int num_queues, int padding)
+{
+   int line_size = padding + strlen(" : \n");
+   int buf_size;
+
+   /* Max print buffer size for controller registers */
+   buf_size = line_size * ARRAY_SIZE(nvme_regs);
+
+   /* Max print buffer size for SQyTDBL and CQxHDBL registers */
+   buf_size += line_size * num_queues * 2;
+
+   return buf_size;
+}
+
+static int nvme_coredump_regs_print(void *buf, int buf_size,
+   struct nvme_ctrl *ctrl, int padding)
+{
+   struct nvme_dev *dev = to_nvme_dev(ctrl);
+   int len = 0;
+   int i;
+
+   for (i = 0; i < ARRAY_SIZE(nvme_regs); i++) {
+   const struct nvme_reg *reg = _regs[i];
+   u64 val;
+
+   if (reg->bits == 32)
+   val = readl(dev->bar + reg->off);
+   else
+   val = readq(dev->bar + reg->off);
+
+   len += snprintf(buf + len, buf_size - len, "%-*s : %llx\n",
+   padding, reg->name, val);
+   }
+
+   for (i = 0; i < ctrl->queue_count; i++) {
+   struct nvme_queue *nvmeq = >queues[i];
+   char name[16];
+
+   sprintf(name, "sq%dtdbl", i);
+   len += snprintf(buf + len, buf_size - len, "%-*s : %x\n",
+   padding, name, readl(nvmeq->q_db));
+
+   sprintf(name, "cq%dhdbl", i);
+   len += snprintf(buf + len, buf_size - len, "%-*s : %x\n",
+   padding, name,
+   readl(nvmeq->q_db + dev->db_stride));
+   }
+
+   return len;
+}
+
+static ssize_t nvme_coredump_read(char *buffer, loff_t offset, size_t count,
+ void *data, size_t datalen)
+{
+   return memory_read_from_buffer(buffer, count, , data, datalen);
+}
+
+static void nvme_coredump_free(void *data)
+{
+   kvfree(data);
+}
+
+static int nvme_coredump_regs(struct dev_coredumpm_bulk_data *data,
+

[PATCH 2/4] devcoredump: allow to create several coredump files in one device

2019-05-02 Thread Akinobu Mita

The device coredump mechanism currently allows drivers to create only a
single coredump file.  If there are several binary blobs to dump, we need
to define a binary format or conver to text format in order to put them
into a single coredump file.

This provides a new function that allows drivers to create several device
coredump files in one crashed device.

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Signed-off-by: Akinobu Mita 
---
 drivers/base/devcoredump.c  | 162 ++--
 include/linux/devcoredump.h |  33 +
 2 files changed, 146 insertions(+), 49 deletions(-)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index 3c960a6..30ddc5e 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -25,14 +25,18 @@ static bool devcd_disabled;
 /* if data isn't read by userspace after 5 minutes then delete it */
 #define DEVCD_TIMEOUT  (HZ * 60 * 5)
 
-struct devcd_entry {
-   struct device devcd_dev;
-   void *data;
-   size_t datalen;
-   struct module *owner;
+struct devcd_file {
+   struct bin_attribute bin_attr;
ssize_t (*read)(char *buffer, loff_t offset, size_t count,
void *data, size_t datalen);
void (*free)(void *data);
+};
+
+struct devcd_entry {
+   struct device devcd_dev;
+   struct devcd_file *files;
+   int num_files;
+   struct module *owner;
struct delayed_work del_wk;
struct device *failing_dev;
 };
@@ -45,8 +49,15 @@ static struct devcd_entry *dev_to_devcd(struct device *dev)
 static void devcd_dev_release(struct device *dev)
 {
struct devcd_entry *devcd = dev_to_devcd(dev);
+   int i;
+
+   for (i = 0; i < devcd->num_files; i++) {
+   struct devcd_file *file = >files[i];
+
+   file->free(file->bin_attr.private);
+   }
+   kfree(devcd->files);
 
-   devcd->free(devcd->data);
module_put(devcd->owner);
 
/*
@@ -64,9 +75,15 @@ static void devcd_dev_release(struct device *dev)
 static void devcd_del(struct work_struct *wk)
 {
struct devcd_entry *devcd;
+   int i;
 
devcd = container_of(wk, struct devcd_entry, del_wk.work);
 
+   for (i = 0; i < devcd->num_files; i++) {
+   device_remove_bin_file(>devcd_dev,
+  >files[i].bin_attr);
+   }
+
device_del(>devcd_dev);
put_device(>devcd_dev);
 }
@@ -75,10 +92,11 @@ static ssize_t devcd_data_read(struct file *filp, struct 
kobject *kobj,
   struct bin_attribute *bin_attr,
   char *buffer, loff_t offset, size_t count)
 {
-   struct device *dev = kobj_to_dev(kobj);
-   struct devcd_entry *devcd = dev_to_devcd(dev);
+   struct devcd_file *file =
+   container_of(bin_attr, struct devcd_file, bin_attr);
 
-   return devcd->read(buffer, offset, count, devcd->data, devcd->datalen);
+   return file->read(buffer, offset, count, bin_attr->private,
+ bin_attr->size);
 }
 
 static ssize_t devcd_data_write(struct file *filp, struct kobject *kobj,
@@ -93,25 +111,6 @@ static ssize_t devcd_data_write(struct file *filp, struct 
kobject *kobj,
return count;
 }
 
-static struct bin_attribute devcd_attr_data = {
-   .attr = { .name = "data", .mode = S_IRUSR | S_IWUSR, },
-   .size = 0,
-   .read = devcd_data_read,
-   .write = devcd_data_write,
-};
-
-static struct bin_attribute *devcd_dev_bin_attrs[] = {
-   _attr_data, NULL,
-};
-
-static const struct attribute_group devcd_dev_group = {
-   .bin_attrs = devcd_dev_bin_attrs,
-};
-
-static const struct attribute_group *devcd_dev_groups[] = {
-   _dev_group, NULL,
-};
-
 static int devcd_free(struct device *dev, void *data)
 {
struct devcd_entry *devcd = dev_to_devcd(dev);
@@ -157,7 +156,6 @@ static struct class devcd_class = {
.name   = "devcoredump",
.owner  = THIS_MODULE,
.dev_release= devcd_dev_release,
-   .dev_groups = devcd_dev_groups,
.class_groups   = devcd_class_groups,
 };
 
@@ -234,30 +232,60 @@ static ssize_t devcd_read_from_sgtable(char *buffer, 
loff_t offset,
  offset);
 }
 
+static struct devcd_entry *devcd_alloc(struct dev_coredumpm_bulk_data *files,
+  int num_files, gfp_t gfp)
+{
+   struct devcd_entry *devcd;
+   int i;
+
+   devcd = kzalloc(sizeof(*devcd), gfp);
+   if (!devcd)
+   return NULL;
+
+   devcd->files = kcalloc(num_files, sizeof(devcd->files[0]), gfp);
+   if (!devcd->files) {
+   kfree(devcd);
+   return NULL;
+   }
+   devcd->num_files = num_files;
+
+

[PATCH 0/4] nvme-pci: support device coredump

2019-05-02 Thread Akinobu Mita

This enables to capture snapshot of controller information via device
coredump machanism, and it helps diagnose and debug issues.

The nvme device coredump is triggered before resetting the controller
caused by I/O timeout, and creates the following coredump files.

- regs: NVMe controller registers, including each I/O queue doorbell
registers, in nvme-show-regs style text format.

- sq: I/O submission queue

- cq: I/O completion queue

The device coredump mechanism currently allows drivers to create only a
single coredump file, so this also provides a new function that allows
drivers to create several device coredump files in one crashed device.

Akinobu Mita (4):
  devcoredump: use memory_read_from_buffer
  devcoredump: allow to create several coredump files in one device
  nvme-pci: add device coredump support
  nvme-pci: trigger device coredump before resetting controller

 drivers/base/devcoredump.c  | 173 +++---
 drivers/nvme/host/Kconfig   |   1 +
 drivers/nvme/host/pci.c | 252 +---
 include/linux/devcoredump.h |  33 ++
 4 files changed, 387 insertions(+), 72 deletions(-)

Cc: Johannes Berg 
Cc: Keith Busch 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
-- 
2.7.4

[PATCH] docs: fix typo in table describing 4.16 development cycle

2018-08-17 Thread Akinobu Mita

Fix s/4.17/4.16/ typo.

Fixes: 8962e40c1993 ("docs: update kernel versions and dates in tables")
Cc: Tim Bird 
Cc: Jonathan Corbet 
Signed-off-by: Akinobu Mita 
---
 Documentation/process/2.Process.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/process/2.Process.rst 
b/Documentation/process/2.Process.rst
index 51d0349..ae020d8 100644
--- a/Documentation/process/2.Process.rst
+++ b/Documentation/process/2.Process.rst
@@ -82,7 +82,7 @@ As an example, here is how the 4.16 development cycle went 
(all dates in
March 114.16-rc5
March 184.16-rc6
March 254.16-rc7
-   April 1 4.17 stable release
+   April 1 4.16 stable release
==  ===
 
 How do the developers decide when to close the development cycle and create
-- 
2.7.4

[PATCH] docs: fix typo in table describing 4.16 development cycle

2018-08-17 Thread Akinobu Mita

Fix s/4.17/4.16/ typo.

Fixes: 8962e40c1993 ("docs: update kernel versions and dates in tables")
Cc: Tim Bird 
Cc: Jonathan Corbet 
Signed-off-by: Akinobu Mita 
---
 Documentation/process/2.Process.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/process/2.Process.rst 
b/Documentation/process/2.Process.rst
index 51d0349..ae020d8 100644
--- a/Documentation/process/2.Process.rst
+++ b/Documentation/process/2.Process.rst
@@ -82,7 +82,7 @@ As an example, here is how the 4.16 development cycle went 
(all dates in
March 114.16-rc5
March 184.16-rc6
March 254.16-rc7
-   April 1 4.17 stable release
+   April 1 4.16 stable release
==  ===
 
 How do the developers decide when to close the development cycle and create
-- 
2.7.4

Re: Nokia N900: v4.16-rc4: oops in iio when grepping sysfs

2018-03-10 Thread Akinobu Mita

2018-03-10 8:01 GMT+09:00 Pavel Machek :
> Hi!
>
>> > Hmm. Looks like there's a lot of fun to be had with sysfs.
>> >
>> >
>> > pavel@n900:~$ uname -a
>> > Linux n900 4.16.0-rc4-59690-g7f84626-dirty #543 Thu Mar 8 19:53:30 CET
>> > 2018 armv7l GNU/Linux
>> >
>> > [  306.402496] bq2415x: command Timer reset
>> > [  312.761322] adp1653 2-0030: Read Addr:03 Val:00 ok
>> > [  313.264129] Unable to handle kernel NULL pointer dereference at
>> > virtual address 0
>> > 000
>> > [  313.272308] pgd = 01336620
>> > [  313.275146] [] *pgd=800af831, *pte=, *ppte=
>> > [  313.281463] Internal error: Oops: 8007 [#1] ARM
>> > [  313.286376] Modules linked in:
>> > [  313.289459] CPU: 0 PID: 3584 Comm: grep Tainted: GW
>> > 4.16.0-rc4-59690-g
>> > 7f84626-dirty #543
>> > [  313.298919] Hardware name: Nokia RX-51 board
>> > [  313.303222] PC is at   (null)
>> > [  313.306213] LR is at iio_ev_state_show+0x38/0x54
>> > [  313.310852] pc : [<>]lr : []psr: a013
>> > [  313.317169] sp : c7b47e70  ip : c087bb24  fp : 0001
>> > [  313.322418] r10: cb19e000  r9 : c0857220  r8 : 1000
>> > [  313.327667] r7 : 0fff  r6 : cb711c80  r5 : cb19e000  r4 :
>> > 
>> > [  313.334228] r3 : 0001  r2 :   r1 : c087b4dc  r0 :
>> > ce584800
>> > [  313.340789] Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM
>> > Segment none
>> > [  313.347991] Control: 10c5387d  Table: 87bec019  DAC: 0051
>> > [  313.353759] Process grep (pid: 3584, stack limit = 0xc4e45eab)
>> > [  313.359619] Stack: (0xc7b47e70 to 0xc7b48000)
>> > [  313.364013] 7e60: c05d6988
>> > cb711b00 ce585c00 c0450d68
>> > [  313.471038] [] (iio_ev_state_show) from []
>> > (dev_attr_show+0x1c/0x4c)
>> > [  313.479187] [] (dev_attr_show) from []
>> > (sysfs_kf_seq_show+0x90/0x108)
>> > [  313.487426] [] (sysfs_kf_seq_show) from []
>> > (kernfs_seq_show+0x24/0x28)
>> > [  313.495758] [] (kernfs_seq_show) from []
>> > (seq_read+0x1dc/0x500)
>> > [  313.503479] [] (seq_read) from []
>> > (__vfs_read+0x2c/0x120)
>> > [  313.510681] [] (__vfs_read) from []
>> > (vfs_read+0x88/0x114)
>> > [  313.517852] [] (vfs_read) from []
>> > (SyS_read+0x40/0x8c)
>> > [  313.524780] [] (SyS_read) from []
>> > (ret_fast_syscall+0x0/0x54)
>> > [  313.532318] Exception stack(0xc7b47fa8 to 0xc7b47ff0)
>> > [  313.537414] 7fa0:   00035330 00042000 0003
>> > 00042000 8000 8000
>> >
>>
>> What file are you opening to cause this?
>
> Strace says:
>
> openat(7, "in_intensity_both_thresh_rising_en",
>> > O_RDONLY|O_LARGEFILE|O_NOFOLLOW) = 3
> fstat64(3, {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0
> ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbe83b714) = -1 ENOTTY
>> > (Inappropriate ioctl for device)
> read(3,
> Message from syslogd@localhost at Mar  9 23:54:39 ...
>  kernel:[ 3097.357696] Internal error: Oops: 8007 [#2] ARM
>
> So that would be:
>
> ./devices/platform/6800.ocp/48072000.i2c/i2c-2/2-0029/iio:device1/events/in_intensity_both_thresh_rising_en
>
> And indeed, manually cat-ing that file reproduces the problem.

This problem happens when no irq is defined for this device.

In this case, tsl2563_info_no_irq whose read_event_config field is NULL
is selected as iio_info.  On the other hand, iio_chan_spec for this
driver always registers event_spec.

So sysfs files related to the channel events are always created even if
no irq is defined.

I think we can fix this issue by defining another iio_chan_spec with
no event_spec for no irq case.

Jonathan, do you have any other idea how to fix this issue?

Re: Nokia N900: v4.16-rc4: oops in iio when grepping sysfs

2018-03-10 Thread Akinobu Mita

2018-03-10 8:01 GMT+09:00 Pavel Machek :
> Hi!
>
>> > Hmm. Looks like there's a lot of fun to be had with sysfs.
>> >
>> >
>> > pavel@n900:~$ uname -a
>> > Linux n900 4.16.0-rc4-59690-g7f84626-dirty #543 Thu Mar 8 19:53:30 CET
>> > 2018 armv7l GNU/Linux
>> >
>> > [  306.402496] bq2415x: command Timer reset
>> > [  312.761322] adp1653 2-0030: Read Addr:03 Val:00 ok
>> > [  313.264129] Unable to handle kernel NULL pointer dereference at
>> > virtual address 0
>> > 000
>> > [  313.272308] pgd = 01336620
>> > [  313.275146] [] *pgd=800af831, *pte=, *ppte=
>> > [  313.281463] Internal error: Oops: 8007 [#1] ARM
>> > [  313.286376] Modules linked in:
>> > [  313.289459] CPU: 0 PID: 3584 Comm: grep Tainted: GW
>> > 4.16.0-rc4-59690-g
>> > 7f84626-dirty #543
>> > [  313.298919] Hardware name: Nokia RX-51 board
>> > [  313.303222] PC is at   (null)
>> > [  313.306213] LR is at iio_ev_state_show+0x38/0x54
>> > [  313.310852] pc : [<>]lr : []psr: a013
>> > [  313.317169] sp : c7b47e70  ip : c087bb24  fp : 0001
>> > [  313.322418] r10: cb19e000  r9 : c0857220  r8 : 1000
>> > [  313.327667] r7 : 0fff  r6 : cb711c80  r5 : cb19e000  r4 :
>> > 
>> > [  313.334228] r3 : 0001  r2 :   r1 : c087b4dc  r0 :
>> > ce584800
>> > [  313.340789] Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM
>> > Segment none
>> > [  313.347991] Control: 10c5387d  Table: 87bec019  DAC: 0051
>> > [  313.353759] Process grep (pid: 3584, stack limit = 0xc4e45eab)
>> > [  313.359619] Stack: (0xc7b47e70 to 0xc7b48000)
>> > [  313.364013] 7e60: c05d6988
>> > cb711b00 ce585c00 c0450d68
>> > [  313.471038] [] (iio_ev_state_show) from []
>> > (dev_attr_show+0x1c/0x4c)
>> > [  313.479187] [] (dev_attr_show) from []
>> > (sysfs_kf_seq_show+0x90/0x108)
>> > [  313.487426] [] (sysfs_kf_seq_show) from []
>> > (kernfs_seq_show+0x24/0x28)
>> > [  313.495758] [] (kernfs_seq_show) from []
>> > (seq_read+0x1dc/0x500)
>> > [  313.503479] [] (seq_read) from []
>> > (__vfs_read+0x2c/0x120)
>> > [  313.510681] [] (__vfs_read) from []
>> > (vfs_read+0x88/0x114)
>> > [  313.517852] [] (vfs_read) from []
>> > (SyS_read+0x40/0x8c)
>> > [  313.524780] [] (SyS_read) from []
>> > (ret_fast_syscall+0x0/0x54)
>> > [  313.532318] Exception stack(0xc7b47fa8 to 0xc7b47ff0)
>> > [  313.537414] 7fa0:   00035330 00042000 0003
>> > 00042000 8000 8000
>> >
>>
>> What file are you opening to cause this?
>
> Strace says:
>
> openat(7, "in_intensity_both_thresh_rising_en",
>> > O_RDONLY|O_LARGEFILE|O_NOFOLLOW) = 3
> fstat64(3, {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0
> ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbe83b714) = -1 ENOTTY
>> > (Inappropriate ioctl for device)
> read(3,
> Message from syslogd@localhost at Mar  9 23:54:39 ...
>  kernel:[ 3097.357696] Internal error: Oops: 8007 [#2] ARM
>
> So that would be:
>
> ./devices/platform/6800.ocp/48072000.i2c/i2c-2/2-0029/iio:device1/events/in_intensity_both_thresh_rising_en
>
> And indeed, manually cat-ing that file reproduces the problem.

This problem happens when no irq is defined for this device.

In this case, tsl2563_info_no_irq whose read_event_config field is NULL
is selected as iio_info.  On the other hand, iio_chan_spec for this
driver always registers event_spec.

So sysfs files related to the channel events are always created even if
no irq is defined.

I think we can fix this issue by defining another iio_chan_spec with
no event_spec for no irq case.

Jonathan, do you have any other idea how to fix this issue?

Re: [PATCH bpf-next v5 5/5] error-injection: Support fault injection framework

2018-01-13 Thread Akinobu Mita

2018-01-13 2:56 GMT+09:00 Masami Hiramatsu :
> Support in-kernel fault-injection framework via debugfs.
> This allows you to inject a conditional error to specified
> function using debugfs interfaces.
>
> Here is the result of test script described in
> Documentation/fault-injection/fault-injection.txt
>
>   ===
>   # ./test_fail_function.sh
>   1+0 records in
>   1+0 records out
>   1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227404 s, 46.1 MB/s
>   btrfs-progs v4.4
>   See http://btrfs.wiki.kernel.org for more information.
>
>   Label:  (null)
>   UUID:   bfa96010-12e9-4360-aed0-42eec7af5798
>   Node size:  16384
>   Sector size:4096
>   Filesystem size:1001.00MiB
>   Block group profiles:
> Data: single8.00MiB
> Metadata: DUP  58.00MiB
> System:   DUP  12.00MiB
>   SSD detected:   no
>   Incompat features:  extref, skinny-metadata
>   Number of devices:  1
>   Devices:
>  IDSIZE  PATH
>   1  1001.00MiB  /dev/loop2
>
>   mount: mount /dev/loop2 on /opt/tmpmnt failed: Cannot allocate memory
>   SUCCESS!
>   ===
>
>
> Signed-off-by: Masami Hiramatsu 
> Reviewed-by: Josef Bacik 
> ---
>   Changes in v3:
>- Check and adjust error value for each target function
>- Clear kporbe flag for reuse
>- Add more documents and example
>   Changes in v5:
>- Support multi-function error injection
> ---
>  Documentation/fault-injection/fault-injection.txt |   68 
>  kernel/Makefile   |1
>  kernel/fail_function.c|  349 
> +
>  lib/Kconfig.debug |   10 +
>  4 files changed, 428 insertions(+)
>  create mode 100644 kernel/fail_function.c
>
> diff --git a/Documentation/fault-injection/fault-injection.txt 
> b/Documentation/fault-injection/fault-injection.txt
> index 918972babcd8..f4a32463ca48 100644
> --- a/Documentation/fault-injection/fault-injection.txt
> +++ b/Documentation/fault-injection/fault-injection.txt
> @@ -30,6 +30,12 @@ o fail_mmc_request
>injects MMC data errors on devices permitted by setting
>debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request
>
> +o fail_function
> +
> +  injects error return on specific functions, which are marked by
> +  ALLOW_ERROR_INJECTION() macro, by setting debugfs entries
> +  under /sys/kernel/debug/fail_function. No boot option supported.
> +
>  Configure fault-injection capabilities behavior
>  ---
>
> @@ -123,6 +129,29 @@ configuration of fault-injection capabilities.
> default is 'N', setting it to 'Y' will disable failure injections
> when dealing with private (address space) futexes.
>
> +- /sys/kernel/debug/fail_function/inject:
> +
> +   Format: { 'function-name' | '!function-name' | '' }
> +   specifies the target function of error injection by name.
> +   If the function name leads '!' prefix, given function is
> +   removed from injection list. If nothing specified ('')
> +   injection list is cleared.
> +
> +- /sys/kernel/debug/fail_function/injectable:
> +
> +   (read only) shows error injectable functions and what type of
> +   error values can be specified. The error type will be one of
> +   below;
> +   - NULL: retval must be 0.
> +   - ERRNO: retval must be -1 to -MAX_ERRNO (-4096).
> +   - ERR_NULL: retval must be 0 or -1 to -MAX_ERRNO (-4096).
> +
> +- /sys/kernel/debug/fail_function//retval:
> +
> +   specifies the "error" return value to inject to the given
> +   function for given function. This will be created when
> +   user specifies new injection entry.
> +
>  o Boot option
>
>  In order to inject faults while debugfs is not available (early boot time),
> @@ -268,6 +297,45 @@ trap "echo 0 > /sys/kernel/debug/$FAILTYPE/probability" 
> SIGINT SIGTERM EXIT
>  echo "Injecting errors into the module $module... (interrupt to stop)"
>  sleep 100
>
> +--
> +
> +o Inject open_ctree error while btrfs mount
> +
> +#!/bin/bash
> +
> +rm -f testfile.img
> +dd if=/dev/zero of=testfile.img bs=1M seek=1000 count=1
> +DEVICE=$(losetup --show -f testfile.img)
> +mkfs.btrfs -f $DEVICE
> +mkdir -p tmpmnt
> +
> +FAILTYPE=fail_function
> +FAILFUNC=open_ctree
> +echo $FAILFUNC > /sys/kernel/debug/$FAILTYPE/inject
> +echo -12 > /sys/kernel/debug/$FAILTYPE/$FAILFUNC/retval
> +echo N > /sys/kernel/debug/$FAILTYPE/task-filter
> +echo 100 > /sys/kernel/debug/$FAILTYPE/probability
> +echo 0 > /sys/kernel/debug/$FAILTYPE/interval
> +echo -1 > /sys/kernel/debug/$FAILTYPE/times
> +echo 0 > /sys/kernel/debug/$FAILTYPE/space
> +echo 1 > /sys/kernel/debug/$FAILTYPE/verbose

I expected that the fault_attr is created for each target

Re: [PATCH bpf-next v5 5/5] error-injection: Support fault injection framework

2018-01-13 Thread Akinobu Mita

2018-01-13 2:56 GMT+09:00 Masami Hiramatsu :
> Support in-kernel fault-injection framework via debugfs.
> This allows you to inject a conditional error to specified
> function using debugfs interfaces.
>
> Here is the result of test script described in
> Documentation/fault-injection/fault-injection.txt
>
>   ===
>   # ./test_fail_function.sh
>   1+0 records in
>   1+0 records out
>   1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227404 s, 46.1 MB/s
>   btrfs-progs v4.4
>   See http://btrfs.wiki.kernel.org for more information.
>
>   Label:  (null)
>   UUID:   bfa96010-12e9-4360-aed0-42eec7af5798
>   Node size:  16384
>   Sector size:4096
>   Filesystem size:1001.00MiB
>   Block group profiles:
> Data: single8.00MiB
> Metadata: DUP  58.00MiB
> System:   DUP  12.00MiB
>   SSD detected:   no
>   Incompat features:  extref, skinny-metadata
>   Number of devices:  1
>   Devices:
>  IDSIZE  PATH
>   1  1001.00MiB  /dev/loop2
>
>   mount: mount /dev/loop2 on /opt/tmpmnt failed: Cannot allocate memory
>   SUCCESS!
>   ===
>
>
> Signed-off-by: Masami Hiramatsu 
> Reviewed-by: Josef Bacik 
> ---
>   Changes in v3:
>- Check and adjust error value for each target function
>- Clear kporbe flag for reuse
>- Add more documents and example
>   Changes in v5:
>- Support multi-function error injection
> ---
>  Documentation/fault-injection/fault-injection.txt |   68 
>  kernel/Makefile   |1
>  kernel/fail_function.c|  349 
> +
>  lib/Kconfig.debug |   10 +
>  4 files changed, 428 insertions(+)
>  create mode 100644 kernel/fail_function.c
>
> diff --git a/Documentation/fault-injection/fault-injection.txt 
> b/Documentation/fault-injection/fault-injection.txt
> index 918972babcd8..f4a32463ca48 100644
> --- a/Documentation/fault-injection/fault-injection.txt
> +++ b/Documentation/fault-injection/fault-injection.txt
> @@ -30,6 +30,12 @@ o fail_mmc_request
>injects MMC data errors on devices permitted by setting
>debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request
>
> +o fail_function
> +
> +  injects error return on specific functions, which are marked by
> +  ALLOW_ERROR_INJECTION() macro, by setting debugfs entries
> +  under /sys/kernel/debug/fail_function. No boot option supported.
> +
>  Configure fault-injection capabilities behavior
>  ---
>
> @@ -123,6 +129,29 @@ configuration of fault-injection capabilities.
> default is 'N', setting it to 'Y' will disable failure injections
> when dealing with private (address space) futexes.
>
> +- /sys/kernel/debug/fail_function/inject:
> +
> +   Format: { 'function-name' | '!function-name' | '' }
> +   specifies the target function of error injection by name.
> +   If the function name leads '!' prefix, given function is
> +   removed from injection list. If nothing specified ('')
> +   injection list is cleared.
> +
> +- /sys/kernel/debug/fail_function/injectable:
> +
> +   (read only) shows error injectable functions and what type of
> +   error values can be specified. The error type will be one of
> +   below;
> +   - NULL: retval must be 0.
> +   - ERRNO: retval must be -1 to -MAX_ERRNO (-4096).
> +   - ERR_NULL: retval must be 0 or -1 to -MAX_ERRNO (-4096).
> +
> +- /sys/kernel/debug/fail_function//retval:
> +
> +   specifies the "error" return value to inject to the given
> +   function for given function. This will be created when
> +   user specifies new injection entry.
> +
>  o Boot option
>
>  In order to inject faults while debugfs is not available (early boot time),
> @@ -268,6 +297,45 @@ trap "echo 0 > /sys/kernel/debug/$FAILTYPE/probability" 
> SIGINT SIGTERM EXIT
>  echo "Injecting errors into the module $module... (interrupt to stop)"
>  sleep 100
>
> +--
> +
> +o Inject open_ctree error while btrfs mount
> +
> +#!/bin/bash
> +
> +rm -f testfile.img
> +dd if=/dev/zero of=testfile.img bs=1M seek=1000 count=1
> +DEVICE=$(losetup --show -f testfile.img)
> +mkfs.btrfs -f $DEVICE
> +mkdir -p tmpmnt
> +
> +FAILTYPE=fail_function
> +FAILFUNC=open_ctree
> +echo $FAILFUNC > /sys/kernel/debug/$FAILTYPE/inject
> +echo -12 > /sys/kernel/debug/$FAILTYPE/$FAILFUNC/retval
> +echo N > /sys/kernel/debug/$FAILTYPE/task-filter
> +echo 100 > /sys/kernel/debug/$FAILTYPE/probability
> +echo 0 > /sys/kernel/debug/$FAILTYPE/interval
> +echo -1 > /sys/kernel/debug/$FAILTYPE/times
> +echo 0 > /sys/kernel/debug/$FAILTYPE/space
> +echo 1 > /sys/kernel/debug/$FAILTYPE/verbose

I expected that the fault_attr is created for each target function.
(i.e. /sys/kernel/debug/fail_function// directory

Re: [PATCH bpf-next v4 5/5] error-injection: Support fault injection framework

2018-01-11 Thread Akinobu Mita

2018-01-12 1:15 GMT+09:00 Masami Hiramatsu <mhira...@kernel.org>:
> On Thu, 11 Jan 2018 23:44:57 +0900
> Akinobu Mita <akinobu.m...@gmail.com> wrote:
>
>> 2018-01-11 9:51 GMT+09:00 Masami Hiramatsu <mhira...@kernel.org>:
>> > Support in-kernel fault-injection framework via debugfs.
>> > This allows you to inject a conditional error to specified
>> > function using debugfs interfaces.
>> >
>> > Here is the result of test script described in
>> > Documentation/fault-injection/fault-injection.txt
>> >
>> >   ===
>> >   # ./test_fail_function.sh
>> >   1+0 records in
>> >   1+0 records out
>> >   1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227404 s, 46.1 MB/s
>> >   btrfs-progs v4.4
>> >   See http://btrfs.wiki.kernel.org for more information.
>> >
>> >   Label:  (null)
>> >   UUID:   bfa96010-12e9-4360-aed0-42eec7af5798
>> >   Node size:  16384
>> >   Sector size:4096
>> >   Filesystem size:1001.00MiB
>> >   Block group profiles:
>> > Data: single8.00MiB
>> > Metadata: DUP  58.00MiB
>> > System:   DUP  12.00MiB
>> >   SSD detected:   no
>> >   Incompat features:  extref, skinny-metadata
>> >   Number of devices:  1
>> >   Devices:
>> >  IDSIZE  PATH
>> >   1  1001.00MiB  /dev/loop2
>> >
>> >   mount: mount /dev/loop2 on /opt/tmpmnt failed: Cannot allocate memory
>> >   SUCCESS!
>> >   ===
>> >
>> >
>> > Signed-off-by: Masami Hiramatsu <mhira...@kernel.org>
>> > Reviewed-by: Josef Bacik <jba...@fb.com>
>> > ---
>> >   Changes in v3:
>> >- Check and adjust error value for each target function
>> >- Clear kporbe flag for reuse
>> >- Add more documents and example
>> > ---
>> >  Documentation/fault-injection/fault-injection.txt |   62 ++
>> >  kernel/Makefile   |1
>> >  kernel/fail_function.c|  217 
>> > +
>> >  lib/Kconfig.debug |   10 +
>> >  4 files changed, 290 insertions(+)
>> >  create mode 100644 kernel/fail_function.c
>> >
>> > diff --git a/Documentation/fault-injection/fault-injection.txt 
>> > b/Documentation/fault-injection/fault-injection.txt
>> > index 918972babcd8..4aecbceef9d2 100644
>> > --- a/Documentation/fault-injection/fault-injection.txt
>> > +++ b/Documentation/fault-injection/fault-injection.txt
>> > @@ -30,6 +30,12 @@ o fail_mmc_request
>> >injects MMC data errors on devices permitted by setting
>> >debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request
>> >
>> > +o fail_function
>> > +
>> > +  injects error return on specific functions, which are marked by
>> > +  ALLOW_ERROR_INJECTION() macro, by setting debugfs entries
>> > +  under /sys/kernel/debug/fail_function. No boot option supported.
>> > +
>> >  Configure fault-injection capabilities behavior
>> >  ---
>> >
>> > @@ -123,6 +129,24 @@ configuration of fault-injection capabilities.
>> > default is 'N', setting it to 'Y' will disable failure injections
>> > when dealing with private (address space) futexes.
>> >
>> > +- /sys/kernel/debug/fail_function/inject:
>> > +
>> > +   specifies the target function of error injection by name.
>> > +
>> > +- /sys/kernel/debug/fail_function/retval:
>> > +
>> > +   specifies the "error" return value to inject to the given
>> > +   function.
>> > +
>>
>> Is it possible to inject errors into multiple functions at the same time?
>
> Yes, it is.
>
>> If so, it will be more useful to support it in the fault injection, too.
>> Because some kind of bugs are caused by the combination of errors.
>> (e.g. another error in an error path)
>>
>> I suggest the following interface.
>>
>> - /sys/kernel/debug/fail_function/inject:
>>
>>   specifies the target function of error injection by name.
>>   /sys/kernel/debug/fail_function// directory will be created.
>>
>> - /sys/kernel/debug/fail_function/uninject:
>>
>>   specifies the target function of error injection by name that is
>>   currently being injected.  /sys/kernel/debug/fail_function//
>>   directory will be removed.
>>
>> - /sys/kernel/debug/fail_function//retval:
>>
>>   specifies the "error" return value to inject to the given function.
>
> OK, it is easy to make it. But also we might need to consider using bpf
> if we do such complex error injection.
>
> BTW, would we need "uninject" file? or just make inject file accept
> "!function" syntax to remove function as ftrace does?

It also sounds good.  Either way is fine with me.

Re: [PATCH bpf-next v4 5/5] error-injection: Support fault injection framework

2018-01-11 Thread Akinobu Mita

2018-01-12 1:15 GMT+09:00 Masami Hiramatsu :
> On Thu, 11 Jan 2018 23:44:57 +0900
> Akinobu Mita  wrote:
>
>> 2018-01-11 9:51 GMT+09:00 Masami Hiramatsu :
>> > Support in-kernel fault-injection framework via debugfs.
>> > This allows you to inject a conditional error to specified
>> > function using debugfs interfaces.
>> >
>> > Here is the result of test script described in
>> > Documentation/fault-injection/fault-injection.txt
>> >
>> >   ===
>> >   # ./test_fail_function.sh
>> >   1+0 records in
>> >   1+0 records out
>> >   1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227404 s, 46.1 MB/s
>> >   btrfs-progs v4.4
>> >   See http://btrfs.wiki.kernel.org for more information.
>> >
>> >   Label:  (null)
>> >   UUID:   bfa96010-12e9-4360-aed0-42eec7af5798
>> >   Node size:  16384
>> >   Sector size:4096
>> >   Filesystem size:1001.00MiB
>> >   Block group profiles:
>> > Data: single8.00MiB
>> > Metadata: DUP  58.00MiB
>> > System:   DUP  12.00MiB
>> >   SSD detected:   no
>> >   Incompat features:  extref, skinny-metadata
>> >   Number of devices:  1
>> >   Devices:
>> >  IDSIZE  PATH
>> >   1  1001.00MiB  /dev/loop2
>> >
>> >   mount: mount /dev/loop2 on /opt/tmpmnt failed: Cannot allocate memory
>> >   SUCCESS!
>> >   ===
>> >
>> >
>> > Signed-off-by: Masami Hiramatsu 
>> > Reviewed-by: Josef Bacik 
>> > ---
>> >   Changes in v3:
>> >- Check and adjust error value for each target function
>> >- Clear kporbe flag for reuse
>> >- Add more documents and example
>> > ---
>> >  Documentation/fault-injection/fault-injection.txt |   62 ++
>> >  kernel/Makefile   |1
>> >  kernel/fail_function.c|  217 
>> > +
>> >  lib/Kconfig.debug |   10 +
>> >  4 files changed, 290 insertions(+)
>> >  create mode 100644 kernel/fail_function.c
>> >
>> > diff --git a/Documentation/fault-injection/fault-injection.txt 
>> > b/Documentation/fault-injection/fault-injection.txt
>> > index 918972babcd8..4aecbceef9d2 100644
>> > --- a/Documentation/fault-injection/fault-injection.txt
>> > +++ b/Documentation/fault-injection/fault-injection.txt
>> > @@ -30,6 +30,12 @@ o fail_mmc_request
>> >injects MMC data errors on devices permitted by setting
>> >debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request
>> >
>> > +o fail_function
>> > +
>> > +  injects error return on specific functions, which are marked by
>> > +  ALLOW_ERROR_INJECTION() macro, by setting debugfs entries
>> > +  under /sys/kernel/debug/fail_function. No boot option supported.
>> > +
>> >  Configure fault-injection capabilities behavior
>> >  ---
>> >
>> > @@ -123,6 +129,24 @@ configuration of fault-injection capabilities.
>> > default is 'N', setting it to 'Y' will disable failure injections
>> > when dealing with private (address space) futexes.
>> >
>> > +- /sys/kernel/debug/fail_function/inject:
>> > +
>> > +   specifies the target function of error injection by name.
>> > +
>> > +- /sys/kernel/debug/fail_function/retval:
>> > +
>> > +   specifies the "error" return value to inject to the given
>> > +   function.
>> > +
>>
>> Is it possible to inject errors into multiple functions at the same time?
>
> Yes, it is.
>
>> If so, it will be more useful to support it in the fault injection, too.
>> Because some kind of bugs are caused by the combination of errors.
>> (e.g. another error in an error path)
>>
>> I suggest the following interface.
>>
>> - /sys/kernel/debug/fail_function/inject:
>>
>>   specifies the target function of error injection by name.
>>   /sys/kernel/debug/fail_function// directory will be created.
>>
>> - /sys/kernel/debug/fail_function/uninject:
>>
>>   specifies the target function of error injection by name that is
>>   currently being injected.  /sys/kernel/debug/fail_function//
>>   directory will be removed.
>>
>> - /sys/kernel/debug/fail_function//retval:
>>
>>   specifies the "error" return value to inject to the given function.
>
> OK, it is easy to make it. But also we might need to consider using bpf
> if we do such complex error injection.
>
> BTW, would we need "uninject" file? or just make inject file accept
> "!function" syntax to remove function as ftrace does?

It also sounds good.  Either way is fine with me.

Re: [PATCH bpf-next v4 5/5] error-injection: Support fault injection framework

2018-01-11 Thread Akinobu Mita

2018-01-11 9:51 GMT+09:00 Masami Hiramatsu :
> Support in-kernel fault-injection framework via debugfs.
> This allows you to inject a conditional error to specified
> function using debugfs interfaces.
>
> Here is the result of test script described in
> Documentation/fault-injection/fault-injection.txt
>
>   ===
>   # ./test_fail_function.sh
>   1+0 records in
>   1+0 records out
>   1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227404 s, 46.1 MB/s
>   btrfs-progs v4.4
>   See http://btrfs.wiki.kernel.org for more information.
>
>   Label:  (null)
>   UUID:   bfa96010-12e9-4360-aed0-42eec7af5798
>   Node size:  16384
>   Sector size:4096
>   Filesystem size:1001.00MiB
>   Block group profiles:
> Data: single8.00MiB
> Metadata: DUP  58.00MiB
> System:   DUP  12.00MiB
>   SSD detected:   no
>   Incompat features:  extref, skinny-metadata
>   Number of devices:  1
>   Devices:
>  IDSIZE  PATH
>   1  1001.00MiB  /dev/loop2
>
>   mount: mount /dev/loop2 on /opt/tmpmnt failed: Cannot allocate memory
>   SUCCESS!
>   ===
>
>
> Signed-off-by: Masami Hiramatsu 
> Reviewed-by: Josef Bacik 
> ---
>   Changes in v3:
>- Check and adjust error value for each target function
>- Clear kporbe flag for reuse
>- Add more documents and example
> ---
>  Documentation/fault-injection/fault-injection.txt |   62 ++
>  kernel/Makefile   |1
>  kernel/fail_function.c|  217 
> +
>  lib/Kconfig.debug |   10 +
>  4 files changed, 290 insertions(+)
>  create mode 100644 kernel/fail_function.c
>
> diff --git a/Documentation/fault-injection/fault-injection.txt 
> b/Documentation/fault-injection/fault-injection.txt
> index 918972babcd8..4aecbceef9d2 100644
> --- a/Documentation/fault-injection/fault-injection.txt
> +++ b/Documentation/fault-injection/fault-injection.txt
> @@ -30,6 +30,12 @@ o fail_mmc_request
>injects MMC data errors on devices permitted by setting
>debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request
>
> +o fail_function
> +
> +  injects error return on specific functions, which are marked by
> +  ALLOW_ERROR_INJECTION() macro, by setting debugfs entries
> +  under /sys/kernel/debug/fail_function. No boot option supported.
> +
>  Configure fault-injection capabilities behavior
>  ---
>
> @@ -123,6 +129,24 @@ configuration of fault-injection capabilities.
> default is 'N', setting it to 'Y' will disable failure injections
> when dealing with private (address space) futexes.
>
> +- /sys/kernel/debug/fail_function/inject:
> +
> +   specifies the target function of error injection by name.
> +
> +- /sys/kernel/debug/fail_function/retval:
> +
> +   specifies the "error" return value to inject to the given
> +   function.
> +

Is it possible to inject errors into multiple functions at the same time?

If so, it will be more useful to support it in the fault injection, too.
Because some kind of bugs are caused by the combination of errors.
(e.g. another error in an error path)

I suggest the following interface.

- /sys/kernel/debug/fail_function/inject:

  specifies the target function of error injection by name.
  /sys/kernel/debug/fail_function// directory will be created.

- /sys/kernel/debug/fail_function/uninject:

  specifies the target function of error injection by name that is
  currently being injected.  /sys/kernel/debug/fail_function//
  directory will be removed.

- /sys/kernel/debug/fail_function//retval:

  specifies the "error" return value to inject to the given function.

Re: [PATCH bpf-next v4 5/5] error-injection: Support fault injection framework

2018-01-11 Thread Akinobu Mita

2018-01-11 9:51 GMT+09:00 Masami Hiramatsu :
> Support in-kernel fault-injection framework via debugfs.
> This allows you to inject a conditional error to specified
> function using debugfs interfaces.
>
> Here is the result of test script described in
> Documentation/fault-injection/fault-injection.txt
>
>   ===
>   # ./test_fail_function.sh
>   1+0 records in
>   1+0 records out
>   1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227404 s, 46.1 MB/s
>   btrfs-progs v4.4
>   See http://btrfs.wiki.kernel.org for more information.
>
>   Label:  (null)
>   UUID:   bfa96010-12e9-4360-aed0-42eec7af5798
>   Node size:  16384
>   Sector size:4096
>   Filesystem size:1001.00MiB
>   Block group profiles:
> Data: single8.00MiB
> Metadata: DUP  58.00MiB
> System:   DUP  12.00MiB
>   SSD detected:   no
>   Incompat features:  extref, skinny-metadata
>   Number of devices:  1
>   Devices:
>  IDSIZE  PATH
>   1  1001.00MiB  /dev/loop2
>
>   mount: mount /dev/loop2 on /opt/tmpmnt failed: Cannot allocate memory
>   SUCCESS!
>   ===
>
>
> Signed-off-by: Masami Hiramatsu 
> Reviewed-by: Josef Bacik 
> ---
>   Changes in v3:
>- Check and adjust error value for each target function
>- Clear kporbe flag for reuse
>- Add more documents and example
> ---
>  Documentation/fault-injection/fault-injection.txt |   62 ++
>  kernel/Makefile   |1
>  kernel/fail_function.c|  217 
> +
>  lib/Kconfig.debug |   10 +
>  4 files changed, 290 insertions(+)
>  create mode 100644 kernel/fail_function.c
>
> diff --git a/Documentation/fault-injection/fault-injection.txt 
> b/Documentation/fault-injection/fault-injection.txt
> index 918972babcd8..4aecbceef9d2 100644
> --- a/Documentation/fault-injection/fault-injection.txt
> +++ b/Documentation/fault-injection/fault-injection.txt
> @@ -30,6 +30,12 @@ o fail_mmc_request
>injects MMC data errors on devices permitted by setting
>debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request
>
> +o fail_function
> +
> +  injects error return on specific functions, which are marked by
> +  ALLOW_ERROR_INJECTION() macro, by setting debugfs entries
> +  under /sys/kernel/debug/fail_function. No boot option supported.
> +
>  Configure fault-injection capabilities behavior
>  ---
>
> @@ -123,6 +129,24 @@ configuration of fault-injection capabilities.
> default is 'N', setting it to 'Y' will disable failure injections
> when dealing with private (address space) futexes.
>
> +- /sys/kernel/debug/fail_function/inject:
> +
> +   specifies the target function of error injection by name.
> +
> +- /sys/kernel/debug/fail_function/retval:
> +
> +   specifies the "error" return value to inject to the given
> +   function.
> +

Is it possible to inject errors into multiple functions at the same time?

If so, it will be more useful to support it in the fault injection, too.
Because some kind of bugs are caused by the combination of errors.
(e.g. another error in an error path)

I suggest the following interface.

- /sys/kernel/debug/fail_function/inject:

  specifies the target function of error injection by name.
  /sys/kernel/debug/fail_function// directory will be created.

- /sys/kernel/debug/fail_function/uninject:

  specifies the target function of error injection by name that is
  currently being injected.  /sys/kernel/debug/fail_function//
  directory will be removed.

- /sys/kernel/debug/fail_function//retval:

  specifies the "error" return value to inject to the given function.

Re: [PATCH v2 1/2] fault-inject: Restore support for task-independent fault injection

2017-08-22 Thread Akinobu Mita

2017-08-23 8:00 GMT+09:00 Bart Van Assche <bart.vanass...@wdc.com>:
> Certain faults should be injected independent of the context
> in which these occur. Commit e41d58185f14 made it impossible to
> inject faults independent of their context. Restore support for
> task-independent fault injection by adding the attribute 'global'.

There was a the problem reported by fail-make-request user and the
problem is introduced by the follow-up patches for systematic
fault injection.

Please check the commit 9eeb52ae712e ("fault-inject: fix wrong
should_fail() decision in task context") and see if the problem
you reported is identical to the commit.

> References: commit e41d58185f14 ("fault-inject: support systematic fault 
> injection")
> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
> Cc: Dmitry Vyukov <dvyu...@google.com>
> Cc: Akinobu Mita <akinobu.m...@gmail.com>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Andrew Morton <a...@linux-foundation.org>
> ---
>  include/linux/fault-inject.h | 11 +--
>  lib/fault-inject.c   |  4 +++-
>  2 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/fault-inject.h b/include/linux/fault-inject.h
> index 728d4e0292aa..88dae2f21881 100644
> --- a/include/linux/fault-inject.h
> +++ b/include/linux/fault-inject.h
> @@ -18,6 +18,7 @@ struct fault_attr {
> atomic_t times;
> atomic_t space;
> unsigned long verbose;
> +   bool global;
> bool task_filter;
> unsigned long stacktrace_depth;
> unsigned long require_start;
> @@ -30,17 +31,23 @@ struct fault_attr {
> struct dentry *dname;
>  };
>
> -#define FAULT_ATTR_INITIALIZER {   \
> +#define __FAULT_ATTR_INITIALIZER(__global) {   \
> .interval = 1,  \
> .times = ATOMIC_INIT(1),\
> .require_end = ULONG_MAX,   \
> +   .global = (__global),   \
> .stacktrace_depth = 32, \
> .ratelimit_state = RATELIMIT_STATE_INIT_DISABLED,   \
> .verbose = 2,   \
> .dname = NULL,  \
> }
>
> -#define DECLARE_FAULT_ATTR(name) struct fault_attr name = 
> FAULT_ATTR_INITIALIZER
> +#define FAULT_ATTR_INITIALIZER __FAULT_ATTR_INITIALIZER(false)
> +
> +#define DECLARE_FAULT_ATTR(name)   \
> +   struct fault_attr name = __FAULT_ATTR_INITIALIZER(false)
> +#define DECLARE_GLOBAL_FAULT_ATTR(name)\
> +   struct fault_attr name = __FAULT_ATTR_INITIALIZER(true)
>  int setup_fault_attr(struct fault_attr *attr, char *str);
>  bool should_fail(struct fault_attr *attr, ssize_t size);
>
> diff --git a/lib/fault-inject.c b/lib/fault-inject.c
> index 7d315fdb9f13..c8f6ef5df3c6 100644
> --- a/lib/fault-inject.c
> +++ b/lib/fault-inject.c
> @@ -107,7 +107,7 @@ static inline bool fail_stacktrace(struct fault_attr 
> *attr)
>
>  bool should_fail(struct fault_attr *attr, ssize_t size)
>  {
> -   if (in_task()) {
> +   if (!attr->global && in_task()) {
> unsigned int fail_nth = READ_ONCE(current->fail_nth);
>
> if (fail_nth && !WRITE_ONCE(current->fail_nth, fail_nth - 1))
> @@ -224,6 +224,8 @@ struct dentry *fault_create_debugfs_attr(const char *name,
> if (!debugfs_create_u32("verbose_ratelimit_burst", mode, dir,
> >ratelimit_state.burst))
> goto fail;
> +   if (!debugfs_create_bool("global", mode, dir, >global))
> +   goto fail;
> if (!debugfs_create_bool("task-filter", mode, dir, 
> >task_filter))
> goto fail;
>
> --
> 2.14.0
>

Re: [PATCH v2 1/2] fault-inject: Restore support for task-independent fault injection

2017-08-22 Thread Akinobu Mita

2017-08-23 8:00 GMT+09:00 Bart Van Assche :
> Certain faults should be injected independent of the context
> in which these occur. Commit e41d58185f14 made it impossible to
> inject faults independent of their context. Restore support for
> task-independent fault injection by adding the attribute 'global'.

There was a the problem reported by fail-make-request user and the
problem is introduced by the follow-up patches for systematic
fault injection.

Please check the commit 9eeb52ae712e ("fault-inject: fix wrong
should_fail() decision in task context") and see if the problem
you reported is identical to the commit.

> References: commit e41d58185f14 ("fault-inject: support systematic fault 
> injection")
> Signed-off-by: Bart Van Assche 
> Cc: Dmitry Vyukov 
> Cc: Akinobu Mita 
> Cc: Michal Hocko 
> Cc: Andrew Morton 
> ---
>  include/linux/fault-inject.h | 11 +--
>  lib/fault-inject.c   |  4 +++-
>  2 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/fault-inject.h b/include/linux/fault-inject.h
> index 728d4e0292aa..88dae2f21881 100644
> --- a/include/linux/fault-inject.h
> +++ b/include/linux/fault-inject.h
> @@ -18,6 +18,7 @@ struct fault_attr {
> atomic_t times;
> atomic_t space;
> unsigned long verbose;
> +   bool global;
> bool task_filter;
> unsigned long stacktrace_depth;
> unsigned long require_start;
> @@ -30,17 +31,23 @@ struct fault_attr {
> struct dentry *dname;
>  };
>
> -#define FAULT_ATTR_INITIALIZER {   \
> +#define __FAULT_ATTR_INITIALIZER(__global) {   \
> .interval = 1,  \
> .times = ATOMIC_INIT(1),\
> .require_end = ULONG_MAX,   \
> +   .global = (__global),   \
> .stacktrace_depth = 32, \
> .ratelimit_state = RATELIMIT_STATE_INIT_DISABLED,   \
> .verbose = 2,   \
> .dname = NULL,  \
> }
>
> -#define DECLARE_FAULT_ATTR(name) struct fault_attr name = 
> FAULT_ATTR_INITIALIZER
> +#define FAULT_ATTR_INITIALIZER __FAULT_ATTR_INITIALIZER(false)
> +
> +#define DECLARE_FAULT_ATTR(name)   \
> +   struct fault_attr name = __FAULT_ATTR_INITIALIZER(false)
> +#define DECLARE_GLOBAL_FAULT_ATTR(name)\
> +   struct fault_attr name = __FAULT_ATTR_INITIALIZER(true)
>  int setup_fault_attr(struct fault_attr *attr, char *str);
>  bool should_fail(struct fault_attr *attr, ssize_t size);
>
> diff --git a/lib/fault-inject.c b/lib/fault-inject.c
> index 7d315fdb9f13..c8f6ef5df3c6 100644
> --- a/lib/fault-inject.c
> +++ b/lib/fault-inject.c
> @@ -107,7 +107,7 @@ static inline bool fail_stacktrace(struct fault_attr 
> *attr)
>
>  bool should_fail(struct fault_attr *attr, ssize_t size)
>  {
> -   if (in_task()) {
> +   if (!attr->global && in_task()) {
> unsigned int fail_nth = READ_ONCE(current->fail_nth);
>
> if (fail_nth && !WRITE_ONCE(current->fail_nth, fail_nth - 1))
> @@ -224,6 +224,8 @@ struct dentry *fault_create_debugfs_attr(const char *name,
> if (!debugfs_create_u32("verbose_ratelimit_burst", mode, dir,
> >ratelimit_state.burst))
> goto fail;
> +   if (!debugfs_create_bool("global", mode, dir, >global))
> +   goto fail;
> if (!debugfs_create_bool("task-filter", mode, dir, 
> >task_filter))
> goto fail;
>
> --
> 2.14.0
>

[PATCH] fault-inject: fix wrong should_fail() decision in task context

2017-08-01 Thread Akinobu Mita

Commit 1203c8e6fb0a ("fault-inject: simplify access check for fail-nth")
unintentionally broke a conditional statement in should_fail().  Any faults
are not injected in the task context by the change when the systematic
fault injection is not used.

This change restores to the previous correct behaviour.

Fixes: 1203c8e6fb0a ("fault-inject: simplify access check for fail-nth")
Cc: Dmitry Vyukov <dvyu...@google.com>
Cc: Lu Fengqi <lufq.f...@cn.fujitsu.com>
Reported-by: Lu Fengqi <lufq.f...@cn.fujitsu.com>
Signed-off-by: Akinobu Mita <akinobu.m...@gmail.com>
---
 lib/fault-inject.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/fault-inject.c b/lib/fault-inject.c
index 7d315fd..cf7b129 100644
--- a/lib/fault-inject.c
+++ b/lib/fault-inject.c
@@ -110,10 +110,12 @@ bool should_fail(struct fault_attr *attr, ssize_t size)
if (in_task()) {
unsigned int fail_nth = READ_ONCE(current->fail_nth);
 
-   if (fail_nth && !WRITE_ONCE(current->fail_nth, fail_nth - 1))
-   goto fail;
+   if (fail_nth) {
+   if (!WRITE_ONCE(current->fail_nth, fail_nth - 1))
+   goto fail;
 
-   return false;
+   return false;
+   }
}
 
/* No need to check any other properties if the probability is 0 */
-- 
2.7.4

[PATCH] fault-inject: fix wrong should_fail() decision in task context

2017-08-01 Thread Akinobu Mita

Commit 1203c8e6fb0a ("fault-inject: simplify access check for fail-nth")
unintentionally broke a conditional statement in should_fail().  Any faults
are not injected in the task context by the change when the systematic
fault injection is not used.

This change restores to the previous correct behaviour.

Fixes: 1203c8e6fb0a ("fault-inject: simplify access check for fail-nth")
Cc: Dmitry Vyukov 
Cc: Lu Fengqi 
Reported-by: Lu Fengqi 
Signed-off-by: Akinobu Mita 
---
 lib/fault-inject.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/fault-inject.c b/lib/fault-inject.c
index 7d315fd..cf7b129 100644
--- a/lib/fault-inject.c
+++ b/lib/fault-inject.c
@@ -110,10 +110,12 @@ bool should_fail(struct fault_attr *attr, ssize_t size)
if (in_task()) {
unsigned int fail_nth = READ_ONCE(current->fail_nth);
 
-   if (fail_nth && !WRITE_ONCE(current->fail_nth, fail_nth - 1))
-   goto fail;
+   if (fail_nth) {
+   if (!WRITE_ONCE(current->fail_nth, fail_nth - 1))
+   goto fail;
 
-   return false;
+   return false;
+   }
}
 
/* No need to check any other properties if the probability is 0 */
-- 
2.7.4

Re: [PATCH -mm] fault-inject: avoid unwanted data race to task->fail_nth

2017-08-01 Thread Akinobu Mita

2017-08-02 0:54 GMT+09:00 Akinobu Mita <akinobu.m...@gmail.com>:
> 2017-08-01 22:45 GMT+09:00 Dmitry Vyukov <dvyu...@google.com>:
>> On Tue, Aug 1, 2017 at 3:09 PM, Lu Fengqi <lufq.f...@cn.fujitsu.com> wrote:
>>> On Fri, Jul 14, 2017 at 01:14:52AM +0900, Akinobu Mita wrote:
>>>>The fault-inject-make-fail-nth-read-write-interface-symmetric.patch in
>>>>-mm tree allows users to set task->fail_nth for non current task by procfs.
>>>>On the other hand, the current task's fail_nth is decreased to zero in
>>>>fault-injection path without any specific locks.
>>>>
>>>>So we need to prevent the task->fail_nth from being unexpected value by
>>>>data races (for example, setting task->fail_nth to zero while decreasing
>>>>the current->fail_nth).  In this fix, we use READ_ONCE() and WRITE_ONCE()
>>>>to prevent the compiler from creating unsolicited accesses.
>>>>
>>>>Cc: Dmitry Vyukov <dvyu...@google.com>
>>>>Reported-by: Dmitry Vyukov <dvyu...@google.com>
>>>>Signed-off-by: Akinobu Mita <akinobu.m...@gmail.com>
>>>>---
>>>> fs/proc/base.c | 5 +++--
>>>> lib/fault-inject.c | 7 +--
>>>> 2 files changed, 8 insertions(+), 4 deletions(-)
>>>>
>>>>diff --git a/fs/proc/base.c b/fs/proc/base.c
>>>>index ecc8a25..719c2e9 100644
>>>>--- a/fs/proc/base.c
>>>>+++ b/fs/proc/base.c
>>>>@@ -1370,7 +1370,7 @@ static ssize_t proc_fail_nth_write(struct file *file, 
>>>>const char __user *buf,
>>>>   task = get_proc_task(file_inode(file));
>>>>   if (!task)
>>>>   return -ESRCH;
>>>>-  task->fail_nth = n;
>>>>+  WRITE_ONCE(task->fail_nth, n);
>>>>   put_task_struct(task);
>>>>
>>>>   return count;
>>>>@@ -1386,7 +1386,8 @@ static ssize_t proc_fail_nth_read(struct file *file, 
>>>>char __user *buf,
>>>>   task = get_proc_task(file_inode(file));
>>>>   if (!task)
>>>>   return -ESRCH;
>>>>-  len = snprintf(numbuf, sizeof(numbuf), "%u\n", task->fail_nth);
>>>>+  len = snprintf(numbuf, sizeof(numbuf), "%u\n",
>>>>+  READ_ONCE(task->fail_nth));
>>>>   len = simple_read_from_buffer(buf, count, ppos, numbuf, len);
>>>>   put_task_struct(task);
>>>>
>>>>diff --git a/lib/fault-inject.c b/lib/fault-inject.c
>>>>index 09ac73c1..7d315fd 100644
>>>>--- a/lib/fault-inject.c
>>>>+++ b/lib/fault-inject.c
>>>>@@ -107,9 +107,12 @@ static inline bool fail_stacktrace(struct fault_attr 
>>>>*attr)
>>>>
>>>> bool should_fail(struct fault_attr *attr, ssize_t size)
>>>> {
>>>>-  if (in_task() && current->fail_nth) {
>>>>-  if (--current->fail_nth == 0)
>>>>+  if (in_task()) {
>>>>+  unsigned int fail_nth = READ_ONCE(current->fail_nth);
>>>>+
>>>>+  if (fail_nth && !WRITE_ONCE(current->fail_nth, fail_nth - 1))
>>>>   goto fail;
>>>>+
>>>>   return false;
>>>>   }
>>>>
>>>>--
>>>>2.7.4
>>>>
>>>>
>>>>
>>> hi
>>>
>>> I'm a btrfs developer. I found that fail_make_request didn't produce the
>>> expected IO ERROR when running xfstests on linux 4.13-rc1.
>>>
>>> That testcase enable fail_make_request by the following commands:
>>> # echo 100 > /sys/kernel/debug/fail_make_request/probability
>>> # echo 2 > /sys/kernel/debug/fail_make_request/times
>>> # echo 0 > /sys/kernel/debug/fail_make_request/verbose
>>> # echo 1 > /sys/block/sda/sda1/make-it-fail
>>> # dd if=/dev/zero of=/dev/sda1 bs=128K count=1 oflag=direct
>>>
>>> As I understand it, after applying this patch, I have to write
>>> /proc//file-nth firstly so that dd process can catch the IO ERROR.
>>> However, the dd process is so fast that I can't write file-nth.
>>>
>>> So, could you tell me how to produce IO ERROR under these circumstances?
>>
>> Hi,
>>
>> fail-nth is orthogonal to the existing mechanisms, so if you have a
>> setup that fails all sites with certain probability, that should
>> continue to work.
>
> Lu's setting for fail_make_request is fine before introducing systematic
> fault injection and they want to inject fail_make_request only.
>
> So I think we need a global parameter to turn on/off the systematic fault
> injection.  (e.g. /sys/kernel/debug/systematic-fault-inject/enable)

Oops.  That is simply a bug in my patch.  Correct should_fail() is below.

bool should_fail(struct fault_attr *attr, ssize_t size)
{
if (in_task()) {
unsigned int fail_nth = READ_ONCE(current->fail_nth);

if (fail_nth) {
if (!WRITE_ONCE(current->fail_nth, fail_nth - 1))
goto fail;

return false;
}
}
...


>> If you are writing a new facility and want to use fail-nth, then the
>> test process itself needs to cooperate and write fail-nth accordingly.
>> See the original patch for an example of how to do it:
>> https://groups.google.com/d/msg/syzkaller/DbB4rjYd82s/3MHDwtcqCAAJ

Re: [PATCH -mm] fault-inject: avoid unwanted data race to task->fail_nth

2017-08-01 Thread Akinobu Mita

2017-08-02 0:54 GMT+09:00 Akinobu Mita :
> 2017-08-01 22:45 GMT+09:00 Dmitry Vyukov :
>> On Tue, Aug 1, 2017 at 3:09 PM, Lu Fengqi  wrote:
>>> On Fri, Jul 14, 2017 at 01:14:52AM +0900, Akinobu Mita wrote:
>>>>The fault-inject-make-fail-nth-read-write-interface-symmetric.patch in
>>>>-mm tree allows users to set task->fail_nth for non current task by procfs.
>>>>On the other hand, the current task's fail_nth is decreased to zero in
>>>>fault-injection path without any specific locks.
>>>>
>>>>So we need to prevent the task->fail_nth from being unexpected value by
>>>>data races (for example, setting task->fail_nth to zero while decreasing
>>>>the current->fail_nth).  In this fix, we use READ_ONCE() and WRITE_ONCE()
>>>>to prevent the compiler from creating unsolicited accesses.
>>>>
>>>>Cc: Dmitry Vyukov 
>>>>Reported-by: Dmitry Vyukov 
>>>>Signed-off-by: Akinobu Mita 
>>>>---
>>>> fs/proc/base.c | 5 +++--
>>>> lib/fault-inject.c | 7 +--
>>>> 2 files changed, 8 insertions(+), 4 deletions(-)
>>>>
>>>>diff --git a/fs/proc/base.c b/fs/proc/base.c
>>>>index ecc8a25..719c2e9 100644
>>>>--- a/fs/proc/base.c
>>>>+++ b/fs/proc/base.c
>>>>@@ -1370,7 +1370,7 @@ static ssize_t proc_fail_nth_write(struct file *file, 
>>>>const char __user *buf,
>>>>   task = get_proc_task(file_inode(file));
>>>>   if (!task)
>>>>   return -ESRCH;
>>>>-  task->fail_nth = n;
>>>>+  WRITE_ONCE(task->fail_nth, n);
>>>>   put_task_struct(task);
>>>>
>>>>   return count;
>>>>@@ -1386,7 +1386,8 @@ static ssize_t proc_fail_nth_read(struct file *file, 
>>>>char __user *buf,
>>>>   task = get_proc_task(file_inode(file));
>>>>   if (!task)
>>>>   return -ESRCH;
>>>>-  len = snprintf(numbuf, sizeof(numbuf), "%u\n", task->fail_nth);
>>>>+  len = snprintf(numbuf, sizeof(numbuf), "%u\n",
>>>>+  READ_ONCE(task->fail_nth));
>>>>   len = simple_read_from_buffer(buf, count, ppos, numbuf, len);
>>>>   put_task_struct(task);
>>>>
>>>>diff --git a/lib/fault-inject.c b/lib/fault-inject.c
>>>>index 09ac73c1..7d315fd 100644
>>>>--- a/lib/fault-inject.c
>>>>+++ b/lib/fault-inject.c
>>>>@@ -107,9 +107,12 @@ static inline bool fail_stacktrace(struct fault_attr 
>>>>*attr)
>>>>
>>>> bool should_fail(struct fault_attr *attr, ssize_t size)
>>>> {
>>>>-  if (in_task() && current->fail_nth) {
>>>>-  if (--current->fail_nth == 0)
>>>>+  if (in_task()) {
>>>>+  unsigned int fail_nth = READ_ONCE(current->fail_nth);
>>>>+
>>>>+  if (fail_nth && !WRITE_ONCE(current->fail_nth, fail_nth - 1))
>>>>   goto fail;
>>>>+
>>>>   return false;
>>>>   }
>>>>
>>>>--
>>>>2.7.4
>>>>
>>>>
>>>>
>>> hi
>>>
>>> I'm a btrfs developer. I found that fail_make_request didn't produce the
>>> expected IO ERROR when running xfstests on linux 4.13-rc1.
>>>
>>> That testcase enable fail_make_request by the following commands:
>>> # echo 100 > /sys/kernel/debug/fail_make_request/probability
>>> # echo 2 > /sys/kernel/debug/fail_make_request/times
>>> # echo 0 > /sys/kernel/debug/fail_make_request/verbose
>>> # echo 1 > /sys/block/sda/sda1/make-it-fail
>>> # dd if=/dev/zero of=/dev/sda1 bs=128K count=1 oflag=direct
>>>
>>> As I understand it, after applying this patch, I have to write
>>> /proc//file-nth firstly so that dd process can catch the IO ERROR.
>>> However, the dd process is so fast that I can't write file-nth.
>>>
>>> So, could you tell me how to produce IO ERROR under these circumstances?
>>
>> Hi,
>>
>> fail-nth is orthogonal to the existing mechanisms, so if you have a
>> setup that fails all sites with certain probability, that should
>> continue to work.
>
> Lu's setting for fail_make_request is fine before introducing systematic
> fault injection and they want to inject fail_make_request only.
>
> So I think we need a global parameter to turn on/off the systematic fault
> injection.  (e.g. /sys/kernel/debug/systematic-fault-inject/enable)

Oops.  That is simply a bug in my patch.  Correct should_fail() is below.

bool should_fail(struct fault_attr *attr, ssize_t size)
{
if (in_task()) {
unsigned int fail_nth = READ_ONCE(current->fail_nth);

if (fail_nth) {
if (!WRITE_ONCE(current->fail_nth, fail_nth - 1))
goto fail;

return false;
}
}
...


>> If you are writing a new facility and want to use fail-nth, then the
>> test process itself needs to cooperate and write fail-nth accordingly.
>> See the original patch for an example of how to do it:
>> https://groups.google.com/d/msg/syzkaller/DbB4rjYd82s/3MHDwtcqCAAJ

Re: [PATCH -mm] fault-inject: avoid unwanted data race to task->fail_nth

2017-08-01 Thread Akinobu Mita

2017-08-01 22:45 GMT+09:00 Dmitry Vyukov <dvyu...@google.com>:
> On Tue, Aug 1, 2017 at 3:09 PM, Lu Fengqi <lufq.f...@cn.fujitsu.com> wrote:
>> On Fri, Jul 14, 2017 at 01:14:52AM +0900, Akinobu Mita wrote:
>>>The fault-inject-make-fail-nth-read-write-interface-symmetric.patch in
>>>-mm tree allows users to set task->fail_nth for non current task by procfs.
>>>On the other hand, the current task's fail_nth is decreased to zero in
>>>fault-injection path without any specific locks.
>>>
>>>So we need to prevent the task->fail_nth from being unexpected value by
>>>data races (for example, setting task->fail_nth to zero while decreasing
>>>the current->fail_nth).  In this fix, we use READ_ONCE() and WRITE_ONCE()
>>>to prevent the compiler from creating unsolicited accesses.
>>>
>>>Cc: Dmitry Vyukov <dvyu...@google.com>
>>>Reported-by: Dmitry Vyukov <dvyu...@google.com>
>>>Signed-off-by: Akinobu Mita <akinobu.m...@gmail.com>
>>>---
>>> fs/proc/base.c | 5 +++--
>>> lib/fault-inject.c | 7 +--
>>> 2 files changed, 8 insertions(+), 4 deletions(-)
>>>
>>>diff --git a/fs/proc/base.c b/fs/proc/base.c
>>>index ecc8a25..719c2e9 100644
>>>--- a/fs/proc/base.c
>>>+++ b/fs/proc/base.c
>>>@@ -1370,7 +1370,7 @@ static ssize_t proc_fail_nth_write(struct file *file, 
>>>const char __user *buf,
>>>   task = get_proc_task(file_inode(file));
>>>   if (!task)
>>>   return -ESRCH;
>>>-  task->fail_nth = n;
>>>+  WRITE_ONCE(task->fail_nth, n);
>>>   put_task_struct(task);
>>>
>>>   return count;
>>>@@ -1386,7 +1386,8 @@ static ssize_t proc_fail_nth_read(struct file *file, 
>>>char __user *buf,
>>>   task = get_proc_task(file_inode(file));
>>>   if (!task)
>>>   return -ESRCH;
>>>-  len = snprintf(numbuf, sizeof(numbuf), "%u\n", task->fail_nth);
>>>+  len = snprintf(numbuf, sizeof(numbuf), "%u\n",
>>>+  READ_ONCE(task->fail_nth));
>>>   len = simple_read_from_buffer(buf, count, ppos, numbuf, len);
>>>   put_task_struct(task);
>>>
>>>diff --git a/lib/fault-inject.c b/lib/fault-inject.c
>>>index 09ac73c1..7d315fd 100644
>>>--- a/lib/fault-inject.c
>>>+++ b/lib/fault-inject.c
>>>@@ -107,9 +107,12 @@ static inline bool fail_stacktrace(struct fault_attr 
>>>*attr)
>>>
>>> bool should_fail(struct fault_attr *attr, ssize_t size)
>>> {
>>>-  if (in_task() && current->fail_nth) {
>>>-  if (--current->fail_nth == 0)
>>>+  if (in_task()) {
>>>+  unsigned int fail_nth = READ_ONCE(current->fail_nth);
>>>+
>>>+  if (fail_nth && !WRITE_ONCE(current->fail_nth, fail_nth - 1))
>>>   goto fail;
>>>+
>>>   return false;
>>>   }
>>>
>>>--
>>>2.7.4
>>>
>>>
>>>
>> hi
>>
>> I'm a btrfs developer. I found that fail_make_request didn't produce the
>> expected IO ERROR when running xfstests on linux 4.13-rc1.
>>
>> That testcase enable fail_make_request by the following commands:
>> # echo 100 > /sys/kernel/debug/fail_make_request/probability
>> # echo 2 > /sys/kernel/debug/fail_make_request/times
>> # echo 0 > /sys/kernel/debug/fail_make_request/verbose
>> # echo 1 > /sys/block/sda/sda1/make-it-fail
>> # dd if=/dev/zero of=/dev/sda1 bs=128K count=1 oflag=direct
>>
>> As I understand it, after applying this patch, I have to write
>> /proc//file-nth firstly so that dd process can catch the IO ERROR.
>> However, the dd process is so fast that I can't write file-nth.
>>
>> So, could you tell me how to produce IO ERROR under these circumstances?
>
> Hi,
>
> fail-nth is orthogonal to the existing mechanisms, so if you have a
> setup that fails all sites with certain probability, that should
> continue to work.

Lu's setting for fail_make_request is fine before introducing systematic
fault injection and they want to inject fail_make_request only.

So I think we need a global parameter to turn on/off the systematic fault
injection.  (e.g. /sys/kernel/debug/systematic-fault-inject/enable)

> If you are writing a new facility and want to use fail-nth, then the
> test process itself needs to cooperate and write fail-nth accordingly.
> See the original patch for an example of how to do it:
> https://groups.google.com/d/msg/syzkaller/DbB4rjYd82s/3MHDwtcqCAAJ

Re: [PATCH -mm] fault-inject: avoid unwanted data race to task->fail_nth

2017-08-01 Thread Akinobu Mita

2017-08-01 22:45 GMT+09:00 Dmitry Vyukov :
> On Tue, Aug 1, 2017 at 3:09 PM, Lu Fengqi  wrote:
>> On Fri, Jul 14, 2017 at 01:14:52AM +0900, Akinobu Mita wrote:
>>>The fault-inject-make-fail-nth-read-write-interface-symmetric.patch in
>>>-mm tree allows users to set task->fail_nth for non current task by procfs.
>>>On the other hand, the current task's fail_nth is decreased to zero in
>>>fault-injection path without any specific locks.
>>>
>>>So we need to prevent the task->fail_nth from being unexpected value by
>>>data races (for example, setting task->fail_nth to zero while decreasing
>>>the current->fail_nth).  In this fix, we use READ_ONCE() and WRITE_ONCE()
>>>to prevent the compiler from creating unsolicited accesses.
>>>
>>>Cc: Dmitry Vyukov 
>>>Reported-by: Dmitry Vyukov 
>>>Signed-off-by: Akinobu Mita 
>>>---
>>> fs/proc/base.c | 5 +++--
>>> lib/fault-inject.c | 7 +--
>>> 2 files changed, 8 insertions(+), 4 deletions(-)
>>>
>>>diff --git a/fs/proc/base.c b/fs/proc/base.c
>>>index ecc8a25..719c2e9 100644
>>>--- a/fs/proc/base.c
>>>+++ b/fs/proc/base.c
>>>@@ -1370,7 +1370,7 @@ static ssize_t proc_fail_nth_write(struct file *file, 
>>>const char __user *buf,
>>>   task = get_proc_task(file_inode(file));
>>>   if (!task)
>>>   return -ESRCH;
>>>-  task->fail_nth = n;
>>>+  WRITE_ONCE(task->fail_nth, n);
>>>   put_task_struct(task);
>>>
>>>   return count;
>>>@@ -1386,7 +1386,8 @@ static ssize_t proc_fail_nth_read(struct file *file, 
>>>char __user *buf,
>>>   task = get_proc_task(file_inode(file));
>>>   if (!task)
>>>   return -ESRCH;
>>>-  len = snprintf(numbuf, sizeof(numbuf), "%u\n", task->fail_nth);
>>>+  len = snprintf(numbuf, sizeof(numbuf), "%u\n",
>>>+  READ_ONCE(task->fail_nth));
>>>   len = simple_read_from_buffer(buf, count, ppos, numbuf, len);
>>>   put_task_struct(task);
>>>
>>>diff --git a/lib/fault-inject.c b/lib/fault-inject.c
>>>index 09ac73c1..7d315fd 100644
>>>--- a/lib/fault-inject.c
>>>+++ b/lib/fault-inject.c
>>>@@ -107,9 +107,12 @@ static inline bool fail_stacktrace(struct fault_attr 
>>>*attr)
>>>
>>> bool should_fail(struct fault_attr *attr, ssize_t size)
>>> {
>>>-  if (in_task() && current->fail_nth) {
>>>-  if (--current->fail_nth == 0)
>>>+  if (in_task()) {
>>>+  unsigned int fail_nth = READ_ONCE(current->fail_nth);
>>>+
>>>+  if (fail_nth && !WRITE_ONCE(current->fail_nth, fail_nth - 1))
>>>   goto fail;
>>>+
>>>   return false;
>>>   }
>>>
>>>--
>>>2.7.4
>>>
>>>
>>>
>> hi
>>
>> I'm a btrfs developer. I found that fail_make_request didn't produce the
>> expected IO ERROR when running xfstests on linux 4.13-rc1.
>>
>> That testcase enable fail_make_request by the following commands:
>> # echo 100 > /sys/kernel/debug/fail_make_request/probability
>> # echo 2 > /sys/kernel/debug/fail_make_request/times
>> # echo 0 > /sys/kernel/debug/fail_make_request/verbose
>> # echo 1 > /sys/block/sda/sda1/make-it-fail
>> # dd if=/dev/zero of=/dev/sda1 bs=128K count=1 oflag=direct
>>
>> As I understand it, after applying this patch, I have to write
>> /proc//file-nth firstly so that dd process can catch the IO ERROR.
>> However, the dd process is so fast that I can't write file-nth.
>>
>> So, could you tell me how to produce IO ERROR under these circumstances?
>
> Hi,
>
> fail-nth is orthogonal to the existing mechanisms, so if you have a
> setup that fails all sites with certain probability, that should
> continue to work.

Lu's setting for fail_make_request is fine before introducing systematic
fault injection and they want to inject fail_make_request only.

So I think we need a global parameter to turn on/off the systematic fault
injection.  (e.g. /sys/kernel/debug/systematic-fault-inject/enable)

> If you are writing a new facility and want to use fail-nth, then the
> test process itself needs to cooperate and write fail-nth accordingly.
> See the original patch for an example of how to do it:
> https://groups.google.com/d/msg/syzkaller/DbB4rjYd82s/3MHDwtcqCAAJ

Re: [PATCH -mm 3/5] fault-inject: make fail-nth read/write interface symmetric

2017-07-13 Thread Akinobu Mita

2017-07-13 5:49 GMT+09:00 Andrew Morton <a...@linux-foundation.org>:
> On Fri, 7 Apr 2017 22:37:01 +0200 Dmitry Vyukov <dvyu...@google.com> wrote:
>
>> On Thu, Apr 6, 2017 at 4:55 PM, Akinobu Mita <akinobu.m...@gmail.com> wrote:
>> > The read interface for fail-nth looks a bit odd.  Read from this file
>> > returns "N..." or "Y..." (this makes me surprise when cat this
>> > file).  Because there is no EOF condition. The first character indicates
>> > current->fail_nth is zero or not, and then current->fail_nth is reset
>> > to zero.
>> >
>> > Just returning task->fail_nth value is more natural to understand.
>> >
>> > Cc: Dmitry Vyukov <dvyu...@google.com>
>> > Signed-off-by: Akinobu Mita <akinobu.m...@gmail.com>
>> > ---
>> >  Documentation/fault-injection/fault-injection.txt | 13 +++--
>> >  fs/proc/base.c| 14 ++
>> >  2 files changed, 13 insertions(+), 14 deletions(-)
>> >
>> > diff --git a/Documentation/fault-injection/fault-injection.txt 
>> > b/Documentation/fault-injection/fault-injection.txt
>> > index a321905..370ddcb 100644
>> > --- a/Documentation/fault-injection/fault-injection.txt
>> > +++ b/Documentation/fault-injection/fault-injection.txt
>> > @@ -139,9 +139,9 @@ o proc entries
>> >  - /proc/self/task//fail-nth:
>> >
>> > Write to this file of integer N makes N-th call in the task fail.
>> > -   Read from this file returns a single char 'Y' or 'N'
>> > -   that says if the fault setup with a previous write to this file was
>> > -   injected or not, and disables the fault if it wasn't yet injected.
>> > +   Read from this file returns a integer value. A value of '0' 
>> > indicates
>> > +   that the fault setup with a previous write to this file was 
>> > injected.
>> > +   A positive integer N indicates that the fault wasn't yet injected.
>> > Note that this file enables all types of faults (slab, futex, etc).
>> > This setting takes precedence over all other generic debugfs 
>> > settings
>> > like probability, interval, times, etc. But per-capability settings
>> > @@ -325,13 +325,14 @@ int main()
>> > write(fail_nth, buf, strlen(buf));
>> > res = socketpair(AF_LOCAL, SOCK_STREAM, 0, fds);
>> > err = errno;
>> > -   read(fail_nth, buf, 1);
>> > +   pread(fail_nth, buf, sizeof(buf), 0);
>> > if (res == 0) {
>> > close(fds[0]);
>> > close(fds[1]);
>> > }
>> > -   printf("%d-th fault %c: res=%d/%d\n", i, buf[0], res, err);
>> > -   if (buf[0] != 'Y')
>> > +   printf("%d-th fault %c: res=%d/%d\n", i, atoi(buf) ? 'N' : 
>> > 'Y',
>> > +   res, err);
>> > +   if (atoi(buf))
>> > break;
>> > }
>> > return 0;
>> > diff --git a/fs/proc/base.c b/fs/proc/base.c
>> > index 42c52e2..9d14215 100644
>> > --- a/fs/proc/base.c
>> > +++ b/fs/proc/base.c
>> > @@ -1383,7 +1383,8 @@ static ssize_t proc_fail_nth_read(struct file *file, 
>> > char __user *buf,
>> >   size_t count, loff_t *ppos)
>> >  {
>> > struct task_struct *task;
>> > -   int err;
>> > +   char numbuf[PROC_NUMBUF];
>> > +   ssize_t len;
>> >
>> > task = get_proc_task(file_inode(file));
>> > if (!task)
>> > @@ -1391,13 +1392,10 @@ static ssize_t proc_fail_nth_read(struct file 
>> > *file, char __user *buf,
>> > put_task_struct(task);
>> > if (task != current)
>> > return -EPERM;
>> > -   if (count < 1)
>> > -   return -EINVAL;
>> > -   err = put_user((char)(current->fail_nth ? 'N' : 'Y'), buf);
>> > -   if (err)
>> > -   return err;
>> > -   current->fail_nth = 0;
>> > -   return 1;
>> > +   len = snprintf(numbuf, sizeof(numbuf), "%u\n", task->fail_nth);
>>
>> If we allow setting this for non current task, then we need to prevent
>> data races as the task uses task->fail_nth concurrently. Reads then
>> should use READ_ONCE and writes in fault-inject.c should use
>> WRITE_ONCE.
>
> This remains unresolved?

I have just send a proposed fix. (Subject: [PATCH -mm] fault-inject: avoid
unwanted data race to task->fail_nth)

Re: [PATCH -mm 3/5] fault-inject: make fail-nth read/write interface symmetric

2017-07-13 Thread Akinobu Mita

2017-07-13 5:49 GMT+09:00 Andrew Morton :
> On Fri, 7 Apr 2017 22:37:01 +0200 Dmitry Vyukov  wrote:
>
>> On Thu, Apr 6, 2017 at 4:55 PM, Akinobu Mita  wrote:
>> > The read interface for fail-nth looks a bit odd.  Read from this file
>> > returns "N..." or "Y..." (this makes me surprise when cat this
>> > file).  Because there is no EOF condition. The first character indicates
>> > current->fail_nth is zero or not, and then current->fail_nth is reset
>> > to zero.
>> >
>> > Just returning task->fail_nth value is more natural to understand.
>> >
>> > Cc: Dmitry Vyukov 
>> > Signed-off-by: Akinobu Mita 
>> > ---
>> >  Documentation/fault-injection/fault-injection.txt | 13 +++--
>> >  fs/proc/base.c| 14 ++
>> >  2 files changed, 13 insertions(+), 14 deletions(-)
>> >
>> > diff --git a/Documentation/fault-injection/fault-injection.txt 
>> > b/Documentation/fault-injection/fault-injection.txt
>> > index a321905..370ddcb 100644
>> > --- a/Documentation/fault-injection/fault-injection.txt
>> > +++ b/Documentation/fault-injection/fault-injection.txt
>> > @@ -139,9 +139,9 @@ o proc entries
>> >  - /proc/self/task//fail-nth:
>> >
>> > Write to this file of integer N makes N-th call in the task fail.
>> > -   Read from this file returns a single char 'Y' or 'N'
>> > -   that says if the fault setup with a previous write to this file was
>> > -   injected or not, and disables the fault if it wasn't yet injected.
>> > +   Read from this file returns a integer value. A value of '0' 
>> > indicates
>> > +   that the fault setup with a previous write to this file was 
>> > injected.
>> > +   A positive integer N indicates that the fault wasn't yet injected.
>> > Note that this file enables all types of faults (slab, futex, etc).
>> > This setting takes precedence over all other generic debugfs 
>> > settings
>> > like probability, interval, times, etc. But per-capability settings
>> > @@ -325,13 +325,14 @@ int main()
>> > write(fail_nth, buf, strlen(buf));
>> > res = socketpair(AF_LOCAL, SOCK_STREAM, 0, fds);
>> > err = errno;
>> > -   read(fail_nth, buf, 1);
>> > +   pread(fail_nth, buf, sizeof(buf), 0);
>> > if (res == 0) {
>> > close(fds[0]);
>> > close(fds[1]);
>> > }
>> > -   printf("%d-th fault %c: res=%d/%d\n", i, buf[0], res, err);
>> > -   if (buf[0] != 'Y')
>> > +   printf("%d-th fault %c: res=%d/%d\n", i, atoi(buf) ? 'N' : 
>> > 'Y',
>> > +   res, err);
>> > +   if (atoi(buf))
>> > break;
>> > }
>> > return 0;
>> > diff --git a/fs/proc/base.c b/fs/proc/base.c
>> > index 42c52e2..9d14215 100644
>> > --- a/fs/proc/base.c
>> > +++ b/fs/proc/base.c
>> > @@ -1383,7 +1383,8 @@ static ssize_t proc_fail_nth_read(struct file *file, 
>> > char __user *buf,
>> >   size_t count, loff_t *ppos)
>> >  {
>> > struct task_struct *task;
>> > -   int err;
>> > +   char numbuf[PROC_NUMBUF];
>> > +   ssize_t len;
>> >
>> > task = get_proc_task(file_inode(file));
>> > if (!task)
>> > @@ -1391,13 +1392,10 @@ static ssize_t proc_fail_nth_read(struct file 
>> > *file, char __user *buf,
>> > put_task_struct(task);
>> > if (task != current)
>> > return -EPERM;
>> > -   if (count < 1)
>> > -   return -EINVAL;
>> > -   err = put_user((char)(current->fail_nth ? 'N' : 'Y'), buf);
>> > -   if (err)
>> > -   return err;
>> > -   current->fail_nth = 0;
>> > -   return 1;
>> > +   len = snprintf(numbuf, sizeof(numbuf), "%u\n", task->fail_nth);
>>
>> If we allow setting this for non current task, then we need to prevent
>> data races as the task uses task->fail_nth concurrently. Reads then
>> should use READ_ONCE and writes in fault-inject.c should use
>> WRITE_ONCE.
>
> This remains unresolved?

I have just send a proposed fix. (Subject: [PATCH -mm] fault-inject: avoid
unwanted data race to task->fail_nth)

[PATCH -mm] fault-inject: avoid unwanted data race to task->fail_nth

2017-07-13 Thread Akinobu Mita

The fault-inject-make-fail-nth-read-write-interface-symmetric.patch in
-mm tree allows users to set task->fail_nth for non current task by procfs.
On the other hand, the current task's fail_nth is decreased to zero in
fault-injection path without any specific locks.

So we need to prevent the task->fail_nth from being unexpected value by
data races (for example, setting task->fail_nth to zero while decreasing
the current->fail_nth).  In this fix, we use READ_ONCE() and WRITE_ONCE()
to prevent the compiler from creating unsolicited accesses.

Cc: Dmitry Vyukov <dvyu...@google.com>
Reported-by: Dmitry Vyukov <dvyu...@google.com>
Signed-off-by: Akinobu Mita <akinobu.m...@gmail.com>
---
 fs/proc/base.c | 5 +++--
 lib/fault-inject.c | 7 +--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ecc8a25..719c2e9 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1370,7 +1370,7 @@ static ssize_t proc_fail_nth_write(struct file *file, 
const char __user *buf,
task = get_proc_task(file_inode(file));
if (!task)
return -ESRCH;
-   task->fail_nth = n;
+   WRITE_ONCE(task->fail_nth, n);
put_task_struct(task);
 
return count;
@@ -1386,7 +1386,8 @@ static ssize_t proc_fail_nth_read(struct file *file, char 
__user *buf,
task = get_proc_task(file_inode(file));
if (!task)
return -ESRCH;
-   len = snprintf(numbuf, sizeof(numbuf), "%u\n", task->fail_nth);
+   len = snprintf(numbuf, sizeof(numbuf), "%u\n",
+   READ_ONCE(task->fail_nth));
len = simple_read_from_buffer(buf, count, ppos, numbuf, len);
put_task_struct(task);
 
diff --git a/lib/fault-inject.c b/lib/fault-inject.c
index 09ac73c1..7d315fd 100644
--- a/lib/fault-inject.c
+++ b/lib/fault-inject.c
@@ -107,9 +107,12 @@ static inline bool fail_stacktrace(struct fault_attr *attr)
 
 bool should_fail(struct fault_attr *attr, ssize_t size)
 {
-   if (in_task() && current->fail_nth) {
-   if (--current->fail_nth == 0)
+   if (in_task()) {
+   unsigned int fail_nth = READ_ONCE(current->fail_nth);
+
+   if (fail_nth && !WRITE_ONCE(current->fail_nth, fail_nth - 1))
goto fail;
+
return false;
}
 
-- 
2.7.4

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1201 matches

Mail list logo