Re: [PATCH v4 0/3] Fix pinctrl-single pcs_pin_dbg_show()

2021-03-24 Thread Hawa, Hanna




On 3/22/2021 7:56 AM, Drew Fustini wrote:

I'm curious what SoC are you using?


I'm working on Amazon Annapurna Labs SoCs (based on ARM cortex 
processors). That include multiple pins controlled with same register.




It's good to know who has hardware to test bits_per_mux in the future.

I pay attention to pinctrl-single as that is the driver used for the TI
AM3358 SoC used in a variety of BeagleBone boards.  It does not use
bits_per_mux, but I can verify that this does not cause any regression
for the AM3358 SoC:

   /sys/kernel/debug/pinctrl/44e10800.pinmux-pinctrl-single# cat pins
   registered pins: 142
   pin 0 (PIN0) 0:? 44e10800 0027 pinctrl-single
   pin 1 (PIN1) 0:? 44e10804 0027 pinctrl-single
   pin 2 (PIN2) 0:? 44e10808 0027 pinctrl-single
   pin 3 (PIN3) 0:? 44e1080c 0027 pinctrl-single
   pin 4 (PIN4) 0:? 44e10810 0027 pinctrl-single
   pin 5 (PIN5) 0:? 44e10814 0027 pinctrl-single
   pin 6 (PIN6) 0:? 44e10818 0027 pinctrl-single
   pin 7 (PIN7) 0:? 44e1081c 0027 pinctrl-single
   pin 8 (PIN8) 22:gpio-96-127 44e10820 0027 pinctrl-single
   pin 9 (PIN9) 23:gpio-96-127 44e10824 0037 pinctrl-single
   pin 10 (PIN10) 26:gpio-96-127 44e10828 0037 pinctrl-single
   pin 11 (PIN11) 27:gpio-96-127 44e1082c 0037 pinctrl-single
   pin 12 (PIN12) 0:? 44e10830 0037 pinctrl-single
   
   pin 140 (PIN140) 0:? 44e10a30 0028 pinctrl-single
   pin 141 (PIN141) 13:gpio-64-95 44e10a34 0020 pinctrl-single

Reviewed-by: Drew Fustini


Thanks for review and verify the change.

Thanks,
Hanna



Thanks,
Drew


Re: [PATCH v3 3/3] pinctrl: pinctrl-single: fix pcs_pin_dbg_show() when bits_per_mux is not zero

2021-03-19 Thread Hawa, Hanna




On 3/18/2021 2:15 PM, Andy Shevchenko wrote:



On Wed, Mar 17, 2021 at 11:42 PM Hanna Hawa  wrote:

An SError was detected when trying to print the supported pins in a

What is SError? Yes, I have read a discussion, but here is the hint:
if a person sees this as a first text due to, for example, bisecting
an issue, what she/he can get from this cryptic name?


What you suggest?
s/An SError/A kernel-panic/?
Or remove the sentence and keep the below:
"
This change fixes the pcs_pin_dbg_show() in pinctrl-single driver when 
bits_per_mux is not zero. In addition move offset calculation and pin 
offset in register to common function.

"




pinctrl device which supports multiple pins per register. This change
fixes the pcs_pin_dbg_show() in pinctrl-single driver when bits_per_mux
is not zero. In addition move offset calculation and pin offset in
register to common function.

Reviewed-by: Andy Shevchenko


Thanks




Fixes: 4e7e8017a80e ("pinctrl: pinctrl-single: enhance to configure multiple pins of 
different modules")
Signed-off-by: Hanna Hawa
---
  drivers/pinctrl/pinctrl-single.c | 57 +---
  1 file changed, 37 insertions(+), 20 deletions(-)

diff --git a/drivers/pinctrl/pinctrl-single.c b/drivers/pinctrl/pinctrl-single.c
index f3394517cb2e..4595acf6545e 100644
--- a/drivers/pinctrl/pinctrl-single.c
+++ b/drivers/pinctrl/pinctrl-single.c
@@ -270,20 +270,46 @@ static void __maybe_unused pcs_writel(unsigned val, void 
__iomem *reg)
 writel(val, reg);
  }

+static unsigned int pcs_pin_reg_offset_get(struct pcs_device *pcs,
+  unsigned int pin)
+{
+   unsigned int mux_bytes;
+
+   mux_bytes = pcs->width / BITS_PER_BYTE;

Can be folded to one line.


Ack




+   if (pcs->bits_per_mux) {
+   unsigned int pin_offset_bytes;
+
+   pin_offset_bytes = (pcs->bits_per_pin * pin) / BITS_PER_BYTE;
+   return (pin_offset_bytes / mux_bytes) * mux_bytes;

Side note for the further improvements (in a separate change, because
I see that you just copied an original code, and after all this is
just a fix patch): this can be replaced by round down APIs (one which
works for arbitrary divisors).


Agree, didn't want to change the formula as it's fix patch. (here and 
below), this can be taken for further improvements.





+   }
+
+   return pin * mux_bytes;
+}
+
+static unsigned int pcs_pin_shift_reg_get(struct pcs_device *pcs,
+ unsigned int pin)
+{
+   return (pin % (pcs->width / pcs->bits_per_pin)) * pcs->bits_per_pin;

Also a side note: I'm wondering if this can be optimized to have less divisions.


+}
+
  static void pcs_pin_dbg_show(struct pinctrl_dev *pctldev,
 struct seq_file *s,
 unsigned pin)
  {


Thanks,
Hanna


Re: [PATCH v2 3/3] pinctrl: pinctrl-single: fix pcs_pin_dbg_show() when bits_per_mux != 0

2021-03-17 Thread Hawa, Hanna




On 3/17/2021 2:27 PM, Andy Shevchenko wrote:


On Tue, Mar 16, 2021 at 11:24 PM Hanna Hawa  wrote:


An SError was detected when trying to print the supported pins in a


What SError is?


System error:
[   24.257831] SError Interrupt on CPU0, code 0xbf02 -- SError
...
[   24.257855] Kernel panic - not syncing: Asynchronous SError Interrupt




pinctrl device which supports multiple pins per register. This change
fixes the pcs_pin_dbg_show() in pinctrl-single driver when
bits_per_mux != 0. In addition move offset calculation and pin offset in


'!= 0' -> 'is not zero'


Will be fixed.




register to common function.


Fixes tag?


Sure, will be added.




Signed-off-by: Hanna Hawa 
---
  drivers/pinctrl/pinctrl-single.c | 66 ++--
  1 file changed, 45 insertions(+), 21 deletions(-)

diff --git a/drivers/pinctrl/pinctrl-single.c b/drivers/pinctrl/pinctrl-single.c
index f3394517cb2e..434f90c8b1b3 100644
--- a/drivers/pinctrl/pinctrl-single.c
+++ b/drivers/pinctrl/pinctrl-single.c
@@ -270,20 +270,53 @@ static void __maybe_unused pcs_writel(unsigned val, void 
__iomem *reg)
 writel(val, reg);
  }

+static unsigned int pcs_pin_reg_offset_get(struct pcs_device *pcs,
+  unsigned int pin)
+{
+   unsigned int offset, mux_bytes;


Offset can be replaced by direct return statements.


Ack.




+   mux_bytes = pcs->width / BITS_PER_BYTE;
+
+   if (pcs->bits_per_mux) {
+   unsigned int pin_offset_bytes;
+
+   pin_offset_bytes = (pcs->bits_per_pin * pin) / BITS_PER_BYTE;
+   offset = (pin_offset_bytes / mux_bytes) * mux_bytes;
+   } else {
+   offset = pin * mux_bytes;
+   }
+
+   return offset;
+}
+
+static unsigned int pcs_pin_shift_reg_get(struct pcs_device *pcs,
+ unsigned int pin)
+{
+   return ((pin % (pcs->width / pcs->bits_per_pin)) * pcs->bits_per_pin);


Too many parentheses.


Will remove the extra parentheses.




+}
+
  static void pcs_pin_dbg_show(struct pinctrl_dev *pctldev,
 struct seq_file *s,
 unsigned pin)
  {
 struct pcs_device *pcs;
-   unsigned val, mux_bytes;
+   unsigned int val;
 unsigned long offset;
 size_t pa;

 pcs = pinctrl_dev_get_drvdata(pctldev);

-   mux_bytes = pcs->width / BITS_PER_BYTE;
-   offset = pin * mux_bytes;
-   val = pcs->read(pcs->base + offset);
+   offset = pcs_pin_reg_offset_get(pcs, pin);
+
+   if (pcs->bits_per_mux) {
+   unsigned int pin_shift_in_reg = pcs_pin_shift_reg_get(pcs, pin);



+   val = pcs->read(pcs->base + offset)
+   & (pcs->fmask << pin_shift_in_reg);


One line?
At least move & to the upper line.


+   } else {
+   val = pcs->read(pcs->base + offset);


It's the same as in above branch, why not

val = read();
if ()
  val &= fmask << _reg_get(...);

?


Agree, looks better

Thanks Andy, will send new patchset soon.

Thanks,
Hanna


Re: [PATCH v5 0/6] Amazon's Annapurna Labs Alpine v3 device-tree

2020-07-23 Thread Hawa, Hanna




On 7/23/2020 1:10 PM, Arnd Bergmann wrote:

Hi Hanna and Antoine,

I just came across this old series and noticed we had never merged it.

I don't know if the patches all still apply. Could you check and perhaps
resend to...@kernel.org  if they are still good to go into the coming
merge window?


Hi Arnd,

Sure, will rebase the patches and send ASAP.

Thanks,
Hanna


Re: [PATCH v9 3/3] edac: Add support for Amazon's Annapurna Labs L2 EDAC

2020-05-05 Thread Hawa, Hanna




On 3/10/2020 3:47 PM, Robert Richter wrote:

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



On 29.01.20 21:50:16, Hanna Hawa wrote:

Adds support for Amazon's Annapurna Labs L2 EDAC driver to detect and
report L2 errors.

Signed-off-by: Hanna Hawa 
---
  MAINTAINERS   |   5 +
  drivers/edac/Kconfig  |   8 ++
  drivers/edac/Makefile |   1 +
  drivers/edac/al_l2_edac.c | 270 ++
  4 files changed, 284 insertions(+)
  create mode 100644 drivers/edac/al_l2_edac.c


Hanna, most of the review comments by Boris for patch #1 (al_l1_edac)
apply here too. Please address them.


Hi Boris, Robert,

Sorry for not getting back to you sooner, will address comments on both 
files and upload new patchset ASAP.


Thanks,
Hanna



Thanks,

-Robert



Re: [PATCH v6 3/3] edac: Add support for Amazon's Annapurna Labs L2 EDAC

2019-10-10 Thread Hawa, Hanna




On 10/10/2019 2:09 AM, Rob Herring wrote:

+Sudeep

On Mon, Oct 7, 2019 at 10:18 AM Hanna Hawa  wrote:


Adds support for Amazon's Annapurna Labs L2 EDAC driver to detect and
report L2 errors.


I was curious why you needed a DT cache parsing function...

[...]


+static int al_l2_edac_probe(struct platform_device *pdev)
+{
+   struct edac_device_ctl_info *edac_dev;
+   struct al_l2_edac *al_l2;
+   struct device *dev = >dev;
+   int ret, i;
+
+   edac_dev = edac_device_alloc_ctl_info(sizeof(*al_l2), DRV_NAME, 1, "L",
+ 1, 2, NULL, 0,
+ edac_device_alloc_index());
+   if (!edac_dev)
+   return -ENOMEM;
+
+   al_l2 = edac_dev->pvt_info;
+   edac_dev->edac_check = al_l2_edac_check;
+   edac_dev->dev = dev;
+   edac_dev->mod_name = DRV_NAME;
+   edac_dev->dev_name = dev_name(dev);
+   edac_dev->ctl_name = "L2_cache";
+   platform_set_drvdata(pdev, edac_dev);
+
+   INIT_LIST_HEAD(_l2->l2_caches);
+
+   for_each_possible_cpu(i) {
+   struct device_node *cpu;
+   struct device_node *cpu_cache;
+   struct al_l2_cache *l2_cache;
+   bool found = false;
+
+   cpu = of_get_cpu_node(i, NULL);
+   if (!cpu)
+   continue;
+
+   cpu_cache = of_find_next_cache_node(cpu);
+   list_for_each_entry(l2_cache, _l2->l2_caches, list_node) {
+   if (l2_cache->of_node == cpu_cache) {
+   found = true;
+   break;
+   }
+   }
+
+   if (found) {
+   cpumask_set_cpu(i, _cache->cluster_cpus);
+   } else {
+   l2_cache = devm_kzalloc(dev, sizeof(*l2_cache),
+   GFP_KERNEL);
+   l2_cache->of_node = cpu_cache;
+   list_add(_cache->list_node, _l2->l2_caches);
+   cpumask_set_cpu(i, _cache->cluster_cpus);
+   }
+
+   of_node_put(cpu);
+   }


We already have what's probably similar code to parse DT and populate
cacheinfo data. Does that not work for you? If not, why not and can we
extend it?


As I saw in cacheinfo it will return the cacheinfo for the online CPUs 
only, correct me if I'm wrong..


Here I'm parsing all the L2 info for all CPUs depends on DT to get 
"cluster_cpus", and using smp_call_function_any() will call the online 
cpu to read the L2MERRSR status register.




Then your driver might work if the data comes from ACPI instead (or
maybe that's all different, I don't know).


No plan to get it work on ACPI, at least in the near future.



Rob



Re: [PATCH v6 3/3] edac: Add support for Amazon's Annapurna Labs L2 EDAC

2019-10-10 Thread Hawa, Hanna

Hi,


On 10/10/2019 2:19 AM, Rob Herring wrote:

On Mon, Oct 7, 2019 at 10:18 AM Hanna Hawa  wrote:


Adds support for Amazon's Annapurna Labs L2 EDAC driver to detect and
report L2 errors.

Signed-off-by: Hanna Hawa 
---
  MAINTAINERS   |   5 +
  drivers/edac/Kconfig  |   8 ++
  drivers/edac/Makefile |   1 +
  drivers/edac/al_l2_edac.c | 251 ++
  4 files changed, 265 insertions(+)
  create mode 100644 drivers/edac/al_l2_edac.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 7887a62dc843..0eabcfcf91a9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -748,6 +748,11 @@ M: Hanna Hawa 
  S: Maintained
  F: drivers/edac/al_l1_edac.c

+AMAZON ANNAPURNA LABS L2 EDAC
+M: Hanna Hawa 
+S: Maintained
+F: drivers/edac/al_l2_edac.c
+
  AMAZON ANNAPURNA LABS THERMAL MMIO DRIVER
  M: Talel Shenhar 
  S: Maintained
diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index e8161d7c7469..cb394aff1cab 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -82,6 +82,14 @@ config EDAC_AL_L1
   for Amazon's Annapurna Labs SoCs.
   This driver detects errors of L1 caches.

+config EDAC_AL_L2
+   tristate "Amazon's Annapurna Labs L2 EDAC"


I still think this should be an "A57 L2 ECC" driver, but if no one
cares I'll shut up and the 2nd person can rename everything.





+   depends on ARCH_ALPINE


|| COMPILE_TEST


Will be add in next patchset.



Maybe it needs an ARM64 dependency too in this case?


Yes, it need ARM64 dependency, I'll add.

Thanks,
Hanna




+   help
+ Support for L2 error detection and correction
+ for Amazon's Annapurna Labs SoCs.
+ This driver detects errors of L2 caches.
+



+
+   ret = platform_driver_register(_l2_edac_driver);
+   if (ret) {
+   pr_err("Failed to register %s (%d)\n", DRV_NAME, ret);
+   return ret;
+   }
+
+   edac_l2_device = platform_device_register_simple(DRV_NAME, -1, NULL, 0);
+   if (IS_ERR(edac_l2_device)) {
+   pr_err("Failed to register EDAC AL L2 platform device\n");
+   return PTR_ERR(edac_l2_device);
+   }
+
+   return 0;
+}
+
+static void __exit al_l2_exit(void)
+{
+   platform_device_unregister(edac_l2_device);
+   platform_driver_unregister(_l2_edac_driver);
+}
+
+late_initcall(al_l2_init);
+module_exit(al_l2_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Hanna Hawa ");
+MODULE_DESCRIPTION("Amazon's Annapurna Lab's L2 EDAC Driver");
--
2.17.1



Re: [PATCH v4 1/2] edac: Add an API for edac device to report for multiple errors

2019-10-02 Thread Hawa, Hanna




On 9/30/2019 5:50 PM, Borislav Petkov wrote:

On Mon, Sep 23, 2019 at 08:17:40PM +0100, Hanna Hawa wrote:

+void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev,
+   int inst_nr, int block_nr, const char *msg)
+{
+   __edac_device_handle_ce(edac_dev, 1, inst_nr, block_nr, msg);
+}
+EXPORT_SYMBOL_GPL(edac_device_handle_ce);


Eww, I don't like that: exporting the function *and* the __ counterpart.
The user will get confused and that is unnecessary.

See below for a better version. This way you solve the whole deal with a
single patch.

I'm okay with this version, minor comment below.



---
From: Hanna Hawa 
Date: Mon, 23 Sep 2019 20:17:40 +0100
Subject: [PATCH] EDAC/device: Rework error logging API

Make the main workhorse the "count" functions which can log a @count
of errors. Have the current APIs edac_device_handle_{ce,ue}() call
the _count() variants and this way keep the exported symbols number
unchanged.

  [ bp: Rewrite. ]

Signed-off-by: Hanna Hawa 
Signed-off-by: Borislav Petkov 
Cc: b...@amazon.com
Cc: d...@amazon.co.uk
Cc: hano...@amazon.com
Cc: James Morse 
Cc: jon...@amazon.com
Cc: linux-edac 
Cc: Mauro Carvalho Chehab 
Cc: ron...@amazon.com
Cc: ta...@amazon.com
Cc: Tony Luck 
Link: https://lkml.kernel.org/r/20190923191741.29322-2-hhh...@amazon.com
---
  drivers/edac/edac_device.c | 44 ---
  drivers/edac/edac_device.h | 54 ++
  2 files changed, 66 insertions(+), 32 deletions(-)

diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
index 65cf2b9355c4..d4d8bed5b55d 100644
--- a/drivers/edac/edac_device.c
+++ b/drivers/edac/edac_device.c
@@ -555,8 +555,9 @@ static inline int edac_device_get_panic_on_ue(struct 
edac_device_ctl_info
return edac_dev->panic_on_ue;
  }
  
-void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev,

-   int inst_nr, int block_nr, const char *msg)
+void edac_device_handle_ce_count(struct edac_device_ctl_info *edac_dev,
+unsigned int count, int inst_nr, int block_nr,
+const char *msg)
  {
struct edac_device_instance *instance;
struct edac_device_block *block = NULL;


Missing count check, same in *_ue_count():
if (count)
return;


@@ -582,23 +583,24 @@ void edac_device_handle_ce(struct edac_device_ctl_info 
*edac_dev,
  
  	if (instance->nr_blocks > 0) {

block = instance->blocks + block_nr;
-   block->counters.ce_count++;
+   block->counters.ce_count += count;
}
  
  	/* Propagate the count up the 'totals' tree */

-   instance->counters.ce_count++;
-   edac_dev->counters.ce_count++;
+   instance->counters.ce_count += count;
+   edac_dev->counters.ce_count += count;
  
  	if (edac_device_get_log_ce(edac_dev))

edac_device_printk(edac_dev, KERN_WARNING,
-   "CE: %s instance: %s block: %s '%s'\n",
-   edac_dev->ctl_name, instance->name,
-   block ? block->name : "N/A", msg);
+  "CE: %s instance: %s block: %s count: %d 
'%s'\n",
+  edac_dev->ctl_name, instance->name,
+  block ? block->name : "N/A", count, msg);
  }
-EXPORT_SYMBOL_GPL(edac_device_handle_ce);
+EXPORT_SYMBOL_GPL(edac_device_handle_ce_count);
  
-void edac_device_handle_ue(struct edac_device_ctl_info *edac_dev,

-   int inst_nr, int block_nr, const char *msg)
+void edac_device_handle_ue_count(struct edac_device_ctl_info *edac_dev,
+unsigned int count, int inst_nr, int block_nr,
+const char *msg)
  {
struct edac_device_instance *instance;
struct edac_device_block *block = NULL;
@@ -624,22 +626,22 @@ void edac_device_handle_ue(struct edac_device_ctl_info 
*edac_dev,
  
  	if (instance->nr_blocks > 0) {

block = instance->blocks + block_nr;
-   block->counters.ue_count++;
+   block->counters.ue_count += count;
}
  
  	/* Propagate the count up the 'totals' tree */

-   instance->counters.ue_count++;
-   edac_dev->counters.ue_count++;
+   instance->counters.ue_count += count;
+   edac_dev->counters.ue_count += count;
  
  	if (edac_device_get_log_ue(edac_dev))

edac_device_printk(edac_dev, KERN_EMERG,
-   "UE: %s instance: %s block: %s '%s'\n",
-   edac_dev->ctl_name, instance->name,
-   block ? block->name : "N/A", msg);
+  "UE: %s instance: %s block: %s count: %d 
'%s'\n",
+  edac_dev->ctl_name, instance->name,
+  block ? block->name : "N/A", count, msg);
  
  	if 

Re: [PATCH v3 1/2] edac: Add an API for edac device to report for multiple errors

2019-09-23 Thread Hawa, Hanna




On 9/20/2019 9:42 AM, Robert Richter wrote:

On 19.09.19 18:17:12, Hanna Hawa wrote:

Add an API for EDAC device to report multiple errors with same type.

Signed-off-by: Hanna Hawa 


With the change below it looks good to me:

Acked-by: Robert Richter 


Thanks



Thanks,

-Robert


---
  drivers/edac/edac_device.c | 62 ++
  drivers/edac/edac_device.h | 40 
  2 files changed, 82 insertions(+), 20 deletions(-)

diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
index 65cf2b9355c4..866934f2bcb0 100644
--- a/drivers/edac/edac_device.c
+++ b/drivers/edac/edac_device.c
@@ -555,12 +555,16 @@ static inline int edac_device_get_panic_on_ue(struct 
edac_device_ctl_info
return edac_dev->panic_on_ue;
  }
  
-void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev,

-   int inst_nr, int block_nr, const char *msg)
+void __edac_device_handle_ce(struct edac_device_ctl_info *edac_dev,
+unsigned int count, int inst_nr, int block_nr,
+const char *msg)
  {
struct edac_device_instance *instance;
struct edac_device_block *block = NULL;
  
+	if (!count)

+   return;
+


Those checks should be moved to the *_count() variants of both
functions.


Will be moved to the inline functions.



[...]


+static inline void edac_device_handle_ce_count(struct edac_device_ctl_info 
*edac_dev,
+  unsigned int count, int inst_nr,
+  int block_nr, const char *msg)
+{


if (count)
...


+   __edac_device_handle_ce(edac_dev, count, inst_nr, block_nr, msg);
+}
+
+static inline void edac_device_handle_ue_count(struct edac_device_ctl_info 
*edac_dev,
+  unsigned int count, int inst_nr,
+  int block_nr, const char *msg)
+{


Here too.


+   __edac_device_handle_ue(edac_dev, count, inst_nr, block_nr, msg);
+}


Thanks,
Hanna


Re: [PATCH v3 2/2] edac: move edac_device_handle_*() API functions to header

2019-09-23 Thread Hawa, Hanna




On 9/20/2019 9:49 AM, Robert Richter wrote:

On 19.09.19 18:17:13, Hanna Hawa wrote:

Move edac_device_handle_*() functions from source file to header file as
inline funtcion that use the new API with single error.

Signed-off-by: Hanna Hawa 


With the changes below it looks good to me:

Acked-by: Robert Richter 

Thanks,

-Robert


diff --git a/drivers/edac/edac_device.h b/drivers/edac/edac_device.h
index 30dc5f5979c8..796ea134a691 100644
--- a/drivers/edac/edac_device.h
+++ b/drivers/edac/edac_device.h
@@ -285,29 +285,6 @@ extern int edac_device_add_device(struct 
edac_device_ctl_info *edac_dev);
   */
  extern struct edac_device_ctl_info *edac_device_del_device(struct device 
*dev);
  
-/**

- * edac_device_handle_ue():
- * perform a common output and handling of an 'edac_dev' UE event
- *
- * @edac_dev: pointer to struct _device_ctl_info
- * @inst_nr: number of the instance where the UE error happened
- * @block_nr: number of the block where the UE error happened
- * @msg: message to be printed
- */
-extern void edac_device_handle_ue(struct edac_device_ctl_info *edac_dev,
-   int inst_nr, int block_nr, const char *msg);
-/**
- * edac_device_handle_ce():
- * perform a common output and handling of an 'edac_dev' CE event
- *
- * @edac_dev: pointer to struct _device_ctl_info
- * @inst_nr: number of the instance where the CE error happened
- * @block_nr: number of the block where the CE error happened
- * @msg: message to be printed
- */
-extern void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev,
-   int inst_nr, int block_nr, const char *msg);
-


Just put in the inline replacement here.


I'll re-arrange the functions in patches 1/2 and put the 
*edac_device_handle_* functions here instead of end of file.





  /**
   * edac_device_alloc_index: Allocate a unique device index number
   *
@@ -357,4 +334,18 @@ static inline void edac_device_handle_ue_count(struct 
edac_device_ctl_info *edac
  {
__edac_device_handle_ue(edac_dev, count, inst_nr, block_nr, msg);
  }
+
+static inline void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev,
+int inst_nr, int block_nr,


No need for this linebreak.


It'll be more than 80 characters.




+const char *msg)
+{
+   __edac_device_handle_ce(edac_dev, 1, inst_nr, block_nr, msg);
+}
+
+static inline void edac_device_handle_ue(struct edac_device_ctl_info *edac_dev,
+int inst_nr, int block_nr,


Same here.


+const char *msg)
+{
+   __edac_device_handle_ue(edac_dev, 1, inst_nr, block_nr, msg);
+}
  #endif
--
2.17.1



Re: [PATCH v2 1/2] edac: Add an API for edac device to report for multiple errors

2019-09-19 Thread Hawa, Hanna




On 9/19/2019 9:33 AM, Robert Richter wrote:

On 12.09.19 15:53:04, Hanna Hawa wrote:

Add an API for EDAC device to report multiple errors with same type.

Signed-off-by: Hanna Hawa 
---
  drivers/edac/edac_device.c | 91 ++
  drivers/edac/edac_device.h | 40 +
  2 files changed, 131 insertions(+)

diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
index 65cf2b9355c4..78ac44103acc 100644
--- a/drivers/edac/edac_device.c
+++ b/drivers/edac/edac_device.c
@@ -643,3 +643,94 @@ void edac_device_handle_ue(struct edac_device_ctl_info 
*edac_dev,
block ? block->name : "N/A", msg);
  }
  EXPORT_SYMBOL_GPL(edac_device_handle_ue);
+
+void __edac_device_handle_ce(struct edac_device_ctl_info *edac_dev,
+unsigned int count, int inst_nr, int block_nr,
+const char *msg)
+{


Please do not add a copy here, instead modify the existing function
and share the code with both, old and new functions.


Will be fixed.



Thanks,

-Robert



Re: [PATCH 1/1] edac: Add an API for edac device to report for multiple errors

2019-09-10 Thread Hawa, Hanna

Hi

On 9/8/2019 11:35 AM, Borislav Petkov wrote:

On Sun, Sep 08, 2019 at 10:16:02AM +0200, Borislav Petkov wrote:

On Sun, Sep 08, 2019 at 10:58:31AM +0300, Hawa, Hanna wrote:

Better use WARN_ON_ONCE() to avoid flooding.


In case of two drivers using this function with wrong error count, only the
first WARN_ON_ONCE will catch in this case, and other will miss other wrong
usage of other edac device drivers.


The idea is to catch any driver using a 0 error count and fix it, not to
flood dmesg. You want _ONCE.


... and you want to return early too, i.e.,

if (WARN_ON_ONCE(!error_count))
return;

Frankly, I'd even remove all the warning functionality and simply do

if (!error_count)
return;


I'll keep it simple as you suggest and remove the warning functionality.



but let's see how much it screams first.



Thanks,
Hanna


Re: [PATCH 1/1] edac: Add an API for edac device to report for multiple errors

2019-09-08 Thread Hawa, Hanna




On 9/5/2019 12:56 PM, Robert Richter wrote:

Hi Hanna,

thanks for the update. See below.

On 05.09.19 09:37:45, Hanna Hawa wrote:

Add an API for edac device to report multiple errors with same type.

Signed-off-by: Hanna Hawa 
---
  drivers/edac/edac_device.c | 66 +-
  drivers/edac/edac_device.h | 31 --
  2 files changed, 79 insertions(+), 18 deletions(-)

diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
index 65cf2b9355c4..bf6a4fd9831b 100644
--- a/drivers/edac/edac_device.c
+++ b/drivers/edac/edac_device.c
@@ -555,12 +555,15 @@ static inline int edac_device_get_panic_on_ue(struct 
edac_device_ctl_info
return edac_dev->panic_on_ue;
  }
  
-void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev,

-   int inst_nr, int block_nr, const char *msg)
+static void __edac_device_handle_ce(struct edac_device_ctl_info *edac_dev,
+  u16 error_count, int inst_nr, int block_nr,


Just curious, why u16, some register mask size? Maybe just use unsigned int?


Wanted to be aligned with edac MC.
I can change it to be u32.



I think the variable can be shortened to 'count', the meaning should
still be clear.


I think more clear to include 'error'.
maybe shorter name 'err_count'?




+  const char *msg)
  {
struct edac_device_instance *instance;
struct edac_device_block *block = NULL;
  
+	WARN_ON(!error_count);


Should return in this case.

Better use WARN_ON_ONCE() to avoid flooding.


In case of two drivers using this function with wrong error count, only 
the first WARN_ON_ONCE will catch in this case, and other will miss 
other wrong usage of other edac device drivers.




The check should be moved to edac_device_handle_ce_count().


I'll move it to edac_device_handle_ce_count.




+
if ((inst_nr >= edac_dev->nr_instances) || (inst_nr < 0)) {
edac_device_printk(edac_dev, KERN_ERR,
"INTERNAL ERROR: 'instance' out of range "
@@ -582,27 +585,44 @@ void edac_device_handle_ce(struct edac_device_ctl_info 
*edac_dev,
  
  	if (instance->nr_blocks > 0) {

block = instance->blocks + block_nr;
-   block->counters.ce_count++;
+   block->counters.ce_count += error_count;
}
  
  	/* Propagate the count up the 'totals' tree */

-   instance->counters.ce_count++;
-   edac_dev->counters.ce_count++;
+   instance->counters.ce_count += error_count;
+   edac_dev->counters.ce_count += error_count;
  
  	if (edac_device_get_log_ce(edac_dev))

edac_device_printk(edac_dev, KERN_WARNING,
-   "CE: %s instance: %s block: %s '%s'\n",
+   "CE: %s instance: %s block: %s count: %d 
'%s'\n",
edac_dev->ctl_name, instance->name,
-   block ? block->name : "N/A", msg);
+   block ? block->name : "N/A", error_count, msg);
+}
+
+void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev,
+  int inst_nr, int block_nr, const char *msg)
+{
+   __edac_device_handle_ce(edac_dev, 1, inst_nr, block_nr, msg);
  }
  EXPORT_SYMBOL_GPL(edac_device_handle_ce);


We could just export the __*() version of those functions and make
everything else inline in the header file? Though, better do this with
two patches to avoid an ABI breakage in case someone wants to backport
it. Let's see what others say here.


Waiting for other reviewers.



  
-void edac_device_handle_ue(struct edac_device_ctl_info *edac_dev,

-   int inst_nr, int block_nr, const char *msg)
+void edac_device_handle_ce_count(struct edac_device_ctl_info *edac_dev,
+u16 error_count, int inst_nr, int block_nr,
+const char *msg)
+{
+   __edac_device_handle_ce(edac_dev, error_count, inst_nr, block_nr, msg);
+}
+EXPORT_SYMBOL_GPL(edac_device_handle_ce_count);
+
+static void __edac_device_handle_ue(struct edac_device_ctl_info *edac_dev,
+  u16 error_count, int inst_nr, int block_nr,
+  const char *msg)


All the above applies for this function too.


  {
struct edac_device_instance *instance;
struct edac_device_block *block = NULL;
  
+	WARN_ON(!error_count);

+
if ((inst_nr >= edac_dev->nr_instances) || (inst_nr < 0)) {
edac_device_printk(edac_dev, KERN_ERR,
"INTERNAL ERROR: 'instance' out of range "
@@ -624,22 +644,36 @@ void edac_device_handle_ue(struct edac_device_ctl_info 
*edac_dev,
  
  	if (instance->nr_blocks > 0) {

block = instance->blocks + block_nr;
-   block->counters.ue_count++;
+   block->counters.ue_count += error_count;
}
  
  	/* Propagate the count up the 

Re: [PATCH v3 4/4] edac: Add support for Amazon's Annapurna Labs L2 EDAC

2019-09-03 Thread Hawa, Hanna




On 9/3/2019 10:27 AM, Robert Richter wrote:

On 15.07.19 16:24:09, Hanna Hawa wrote:

Adds support for Amazon's Annapurna Labs L2 EDAC driver to detect and
report L2 errors.

Signed-off-by: Hanna Hawa 
---
  MAINTAINERS   |   6 ++
  drivers/edac/Kconfig  |   8 ++
  drivers/edac/Makefile |   1 +
  drivers/edac/al_l2_edac.c | 187 ++
  4 files changed, 202 insertions(+)
  create mode 100644 drivers/edac/al_l2_edac.c


 From a brief look at it, it seems some of my comments from 2/4 apply
here too. Please look through it.


Thanks for your review, will look and fix on top of v5.

Thanks,
Hanna



-Robert



Re: [PATCH v3 2/4] edac: Add support for Amazon's Annapurna Labs L1 EDAC

2019-09-03 Thread Hawa, Hanna




On 9/3/2019 10:24 AM, Robert Richter wrote:

On 15.07.19 16:24:07, Hanna Hawa wrote:

Adds support for Amazon's Annapurna Labs L1 EDAC driver to detect and
report L1 errors.

Signed-off-by: Hanna Hawa 
Reviewed-by: James Morse 
---
  MAINTAINERS   |   6 ++
  drivers/edac/Kconfig  |   8 +++
  drivers/edac/Makefile |   1 +
  drivers/edac/al_l1_edac.c | 156 ++
  4 files changed, 171 insertions(+)
  create mode 100644 drivers/edac/al_l1_edac.c



diff --git a/drivers/edac/al_l1_edac.c b/drivers/edac/al_l1_edac.c
new file mode 100644
index 000..70510ea
--- /dev/null
+++ b/drivers/edac/al_l1_edac.c


[...]


+static void al_l1_edac_cpumerrsr(void *arg)


Could this being named to something meaningful, such as
*_read_status() or so?


+{
+   struct edac_device_ctl_info *edac_dev = arg;
+   int cpu, i;
+   u32 ramid, repeat, other, fatal;
+   u64 val = read_sysreg_s(ARM_CA57_CPUMERRSR_EL1);
+   char msg[AL_L1_EDAC_MSG_MAX];
+   int space, count;
+   char *p;
+
+   if (!(FIELD_GET(ARM_CA57_CPUMERRSR_VALID, val)))
+   return;


[...]


+static void al_l1_edac_check(struct edac_device_ctl_info *edac_dev)
+{
+   on_each_cpu(al_l1_edac_cpumerrsr, edac_dev, 1);
+}
+
+static int al_l1_edac_probe(struct platform_device *pdev)
+{
+   struct edac_device_ctl_info *edac_dev;
+   struct device *dev = >dev;
+   int ret;
+
+   edac_dev = edac_device_alloc_ctl_info(0, (char *)dev_name(dev), 1, "L",


This type cast looks broken. dev_name() is a constant string already.

Other drivers do not use the dynamically generated dev_name() string
here, instead a fix string such as mod_name or ctl_name could be used.
edac_device_alloc_ctl_info() later generates a unique instance name
derived from name + index.


Ack, will fix and use DRV_NAME.



Regarding the type, this seems to be an API issue of edac_device_
alloc_ctl_info() that should actually use const char* in its
interface. So if needed (from what I wrote above it is not) the type
in the argument list needs to be changed instead.


+ 1, 1, NULL, 0,
+ edac_device_alloc_index());
+   if (IS_ERR(edac_dev))
+   return -ENOMEM;


Use the original error code instead.


Actually it return NULL in case of failure, it was changed in v5 to 
check if error/NULL.





+
+   edac_dev->edac_check = al_l1_edac_check;
+   edac_dev->dev = dev;
+   edac_dev->mod_name = DRV_NAME;
+   edac_dev->dev_name = dev_name(dev);
+   edac_dev->ctl_name = "L1 cache";


Should not contain spaces and maybe a bit more specific.


L1_cache_ecc_err? or L1_cache_err?




+   platform_set_drvdata(pdev, edac_dev);
+
+   ret = edac_device_add_device(edac_dev);
+   if (ret) {
+   dev_err(dev, "Failed to add L1 edac device\n");


Move this printk below to the error path and maybe print the error
code. You do not cover the -ENOMEM failure.


Ack.



-Robert


+   goto err;
+   }
+
+   return 0;
+err:
+   edac_device_free_ctl_info(edac_dev);
+
+   return ret;
+}


Re: [PATCH v5 1/4] dt-bindings: EDAC: Add Amazon's Annapurna Labs L1 EDAC

2019-08-26 Thread Hawa, Hanna




On 8/21/2019 10:17 PM, Rob Herring wrote:

Why is this even in DT? AFAICT, this is all just CortexA57 core features
(i.e. nothing Amazon specific). The core type and the ECC capabilities
are discoverable.


Added to the DT in order to easily enable/disable the driver. You are 
correct that they are CortexA57 core features and nothing Amazon 
specific, but it's IMPLEMENTATION DEFINED, meaning that in different 
cortex revisions (e.g. A57) the register bitmap may change. Because of 
that we added an Amazon compatible which corresponds to the specific 
core we are using.


Thanks,
Hanna



Rob


Re: [PATCH v4 4/4] edac: Add support for Amazon's Annapurna Labs L2 EDAC

2019-08-05 Thread Hawa, Hanna




On 8/2/2019 6:11 PM, James Morse wrote:

Hi Hanna,

On 01/08/2019 14:09, Hanna Hawa wrote:

Adds support for Amazon's Annapurna Labs L2 EDAC driver to detect and
report L2 errors.
diff --git a/drivers/edac/al_l2_edac.c b/drivers/edac/al_l2_edac.c
new file mode 100644
index ..6c6d37cf82ab
--- /dev/null
+++ b/drivers/edac/al_l2_edac.c
@@ -0,0 +1,189 @@



+#include 
+#include 


#include  ?


Will be added.




+#include 
+#include 


[...]


+static void al_l2_edac_l2merrsr(void *arg)
+{
+   struct edac_device_ctl_info *edac_dev = arg;
+   int cpu, i;
+   u32 ramid, repeat, other, fatal;
+   u64 val = read_sysreg_s(ARM_CA57_L2MERRSR_EL1);
+   char msg[AL_L2_EDAC_MSG_MAX];
+   int space, count;
+   char *p;
+
+   if (!(FIELD_GET(ARM_CA57_L2MERRSR_VALID, val)))
+   return;
+
+   write_sysreg_s(0, ARM_CA57_L2MERRSR_EL1);
+
+   cpu = smp_processor_id();
+   ramid = FIELD_GET(ARM_CA57_L2MERRSR_RAMID, val);
+   repeat = FIELD_GET(ARM_CA57_L2MERRSR_REPEAT, val);
+   other = FIELD_GET(ARM_CA57_L2MERRSR_OTHER, val);
+   fatal = FIELD_GET(ARM_CA57_L2MERRSR_FATAL, val);
+
+   space = sizeof(msg);
+   p = msg;
+   count = scnprintf(p, space, "CPU%d L2 %serror detected", cpu,
+ (fatal) ? "Fatal " : "");
+   p += count;
+   space -= count;
+
+   switch (ramid) {
+   case ARM_CA57_L2_TAG_RAM:
+   count = scnprintf(p, space, " RAMID='L2 Tag RAM'");
+   break;
+   case ARM_CA57_L2_DATA_RAM:
+   count = scnprintf(p, space, " RAMID='L2 Data RAM'");
+   break;
+   case ARM_CA57_L2_SNOOP_RAM:
+   count = scnprintf(p, space, " RAMID='L2 Snoop RAM'");


Nit: The TRMs both call this 'L2 Snoop Tag RAM'. Could we include 'tag' in the
description. 'tag' implies its some kind of metadata, so an uncorrected error 
here affect
a now unknown location, its more series than a 'data RAM' error. v8.2 would 
term this kind
of error 'uncontained'.


Ack, will be fixed.





+   break;
+   case ARM_CA57_L2_DIRTY_RAM:
+   count = scnprintf(p, space, " RAMID='L2 Dirty RAM'");
+   break;
+   case ARM_CA57_L2_INC_PF_RAM:
+   count = scnprintf(p, space, " RAMID='L2 internal metadat'");


Nit: metadata


Ack, will be fixed.




+   break;
+   default:
+   count = scnprintf(p, space, " RAMID='unknown'");
+   break;
+   }
+
+   p += count;
+   space -= count;
+
+   count = scnprintf(p, space,
+ " repeat=%d, other=%d (L2MERRSR_EL1=0x%llx)",
+ repeat, other, val);
+
+   for (i = 0; i < repeat; i++) {
+   if (fatal)
+   edac_device_handle_ue(edac_dev, 0, 0, msg);
+   else
+   edac_device_handle_ce(edac_dev, 0, 0, msg);
+   }
+}


[...]


+static int al_l2_edac_probe(struct platform_device *pdev)
+{
+   struct edac_device_ctl_info *edac_dev;
+   struct al_l2_edac *al_l2;
+   struct device *dev = >dev;
+   int ret, i;
+
+   edac_dev = edac_device_alloc_ctl_info(sizeof(*al_l2),
+ (char *)dev_name(dev), 1, "L", 1,
+ 2, NULL, 0,
+ edac_device_alloc_index());
+   if (IS_ERR_OR_NULL(edac_dev))
+   return -ENOMEM;
+
+   al_l2 = edac_dev->pvt_info;
+   edac_dev->edac_check = al_l2_edac_check;
+   edac_dev->dev = dev;
+   edac_dev->mod_name = DRV_NAME;
+   edac_dev->dev_name = dev_name(dev);
+   edac_dev->ctl_name = "L2 cache";
+   platform_set_drvdata(pdev, edac_dev);



+   for_each_online_cpu(i) {


for_each_possible_cpu()?

If you boot with maxcpus= the driver's behaviour changes.
But you are only parsing information from the DT, so you don't really need the 
CPUs to be
online.
Agree, for dt parsing no need the online CPUs, if 
for_each_possible_cpu() used then smp_call_function_any() will run only 
on the online CPUs in the mask.


Will change to for_each_possible_cpu().





+   struct device_node *cpu;
+   struct device_node *cpu_cache, *l2_cache;
+
+   cpu = of_get_cpu_node(i, NULL);


(of_get_cpu_node() can return NULL, but I don't think it can ever happen like 
this)


+   cpu_cache = of_find_next_cache_node(cpu);
+   l2_cache = of_parse_phandle(dev->of_node, "l2-cache", 0);
+
+   if (cpu_cache == l2_cache)
+   cpumask_set_cpu(i, _l2->cluster_cpus);


You need to of_node_put() these device_node pointers once you're done with them.


Will be added.





+   }
+
+   if (cpumask_empty(_l2->cluster_cpus)) {
+   dev_err(dev, "CPU mask is empty for this L2 cache\n");
+   ret = -EINVAL;
+   goto err;
+   

Re: [RFC 1/1] edac: Add a counter parameter for edac_device_handle_ue/ce()

2019-08-01 Thread Hawa, Hanna




On 8/1/2019 5:17 PM, Robert Richter wrote:

Don't you think it'll be confused to have different APIs between EDAC_MC and
EDAC_DEVICE?
(in MC the count passed as part of edac_mc_handle_error())

I don't think edac_mc_handle_error() with 11 function arguments is a
good reference for somethin we want to adopt. For the majority of
drivers you just introduce another useless argument with the following
pattern:

edac_device_handle_ce(edac_dev, 1, 0, 0, edac_dev_name);

IMO, the api should be improved when touching it.


Got it, I'll update the patch as you suggested.

Thanks,
Hanna



-Robert


Re: [RFC 1/1] edac: Add a counter parameter for edac_device_handle_ue/ce()

2019-08-01 Thread Hawa, Hanna




On 8/1/2019 2:35 PM, Robert Richter wrote:

On 15.07.19 13:53:07, Hanna Hawa wrote:

Add a counter parameter in order to avoid losing errors count for edac
device, the error count reports the number of errors reported by an edac
device similar to the way MC_EDAC do.

Signed-off-by: Hanna Hawa 
---
  drivers/edac/altera_edac.c  | 20 
  drivers/edac/amd8111_edac.c |  6 +++---
  drivers/edac/cpc925_edac.c  |  4 ++--
  drivers/edac/edac_device.c  | 18 ++
  drivers/edac/edac_device.h  |  8 ++--
  drivers/edac/highbank_l2_edac.c |  4 ++--
  drivers/edac/mpc85xx_edac.c |  4 ++--
  drivers/edac/mv64x60_edac.c |  4 ++--
  drivers/edac/octeon_edac-l2c.c  | 20 ++--
  drivers/edac/octeon_edac-pc.c   |  6 +++---
  drivers/edac/qcom_edac.c|  8 
  drivers/edac/thunderx_edac.c| 10 +-
  drivers/edac/xgene_edac.c   | 26 +-
  13 files changed, 74 insertions(+), 64 deletions(-)



diff --git a/drivers/edac/edac_device.h b/drivers/edac/edac_device.h
index 1aaba74..cf1a1da 100644
--- a/drivers/edac/edac_device.h
+++ b/drivers/edac/edac_device.h
@@ -290,23 +290,27 @@ extern struct edac_device_ctl_info 
*edac_device_del_device(struct device *dev);
   *perform a common output and handling of an 'edac_dev' UE event
   *
   * @edac_dev: pointer to struct _device_ctl_info
+ * @error_count: number of errors of the same type
   * @inst_nr: number of the instance where the UE error happened
   * @block_nr: number of the block where the UE error happened
   * @msg: message to be printed
   */
  extern void edac_device_handle_ue(struct edac_device_ctl_info *edac_dev,
-   int inst_nr, int block_nr, const char *msg);
+ u16 error_count, int inst_nr, int block_nr,
+ const char *msg);
  /**
   * edac_device_handle_ce():
   *perform a common output and handling of an 'edac_dev' CE event
   *
   * @edac_dev: pointer to struct _device_ctl_info
+ * @error_count: number of errors of the same type
   * @inst_nr: number of the instance where the CE error happened
   * @block_nr: number of the block where the CE error happened
   * @msg: message to be printed
   */
  extern void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev,


How about renaming this to __edac_device_handle_ce() and then have 2
macros for:

  * edac_device_handle_ce() to keep old i/f.

  * edac_device_handle_ce_count(), with count parameter added.

Same for uncorrectable errors.

Code of other driver can be kept as it is then.


Don't you think it'll be confused to have different APIs between EDAC_MC 
and EDAC_DEVICE?

(in MC the count passed as part of edac_mc_handle_error())

I don't have strong objection, the change for other drivers is not that 
hard.




Thanks,

-Robert


-   int inst_nr, int block_nr, const char *msg);
+ u16 error_count, int inst_nr, int block_nr,
+ const char *msg);


Re: [PATCH v3 2/4] edac: Add support for Amazon's Annapurna Labs L1 EDAC

2019-08-01 Thread Hawa, Hanna



On 7/26/2019 7:49 PM, James Morse wrote:

Hi Hanna,

On 15/07/2019 14:24, Hanna Hawa wrote:

Adds support for Amazon's Annapurna Labs L1 EDAC driver to detect and
report L1 errors.



diff --git a/drivers/edac/al_l1_edac.c b/drivers/edac/al_l1_edac.c
new file mode 100644
index 000..70510ea
--- /dev/null
+++ b/drivers/edac/al_l1_edac.c
@@ -0,0 +1,156 @@



+#include 


You need  for on-each_cpu().


+#include "edac_device.h"
+#include "edac_module.h"


You need  for the sys_reg() macro. The ARCH_ALPINE dependency 
doesn't stop
this from being built on 32bit arm, where this sys_reg() won't work/exist.


Will fix.



[...]


+static void al_l1_edac_cpumerrsr(void *arg)
+{
+   struct edac_device_ctl_info *edac_dev = arg;
+   int cpu, i;
+   u32 ramid, repeat, other, fatal;
+   u64 val = read_sysreg_s(ARM_CA57_CPUMERRSR_EL1);
+   char msg[AL_L1_EDAC_MSG_MAX];
+   int space, count;
+   char *p;
+   if (!(FIELD_GET(ARM_CA57_CPUMERRSR_VALID, val)))
+   return;
+   space = sizeof(msg);
+   p = msg;
+   count = snprintf(p, space, "CPU%d L1 %serror detected", cpu,
+(fatal) ? "Fatal " : "");
+   p += count;
+   space -= count;


snprintf() will return the number of characters it would have generated, even 
if that is
more than space. If this happen, space becomes negative, p points outside msg[] 
and msg[]
isn't NULL terminated...

It looks like you want scnprintf(), which returns the number of characters 
written to buf
instead. (I don't see how 256 characters would be printed by this code)


Will use scnprintf() instead.





+   switch (ramid) {
+   case ARM_CA57_L1_I_TAG_RAM:
+   count = snprintf(p, space, " RAMID='L1-I Tag RAM'");
+   break;
+   case ARM_CA57_L1_I_DATA_RAM:
+   count = snprintf(p, space, " RAMID='L1-I Data RAM'");
+   break;
+   case ARM_CA57_L1_D_TAG_RAM:
+   count = snprintf(p, space, " RAMID='L1-D Tag RAM'");
+   break;
+   case ARM_CA57_L1_D_DATA_RAM:
+   count = snprintf(p, space, " RAMID='L1-D Data RAM'");
+   break;
+   case ARM_CA57_L2_TLB_RAM:
+   count = snprintf(p, space, " RAMID='L2 TLB RAM'");
+   break;
+   default:
+   count = snprintf(p, space, " RAMID='unknown'");
+   break;
+   }
+
+   p += count;
+   space -= count;
+   count = snprintf(p, space,
+" repeat=%d, other=%d (CPUMERRSR_EL1=0x%llx)",
+repeat, other, val);
+
+   for (i = 0; i < repeat; i++) {
+   if (fatal)
+   edac_device_handle_ue(edac_dev, 0, 0, msg);
+   else
+   edac_device_handle_ce(edac_dev, 0, 0, msg);
+   }
+
+   write_sysreg_s(0, ARM_CA57_CPUMERRSR_EL1);


Writing 0 just after you've read the value would minimise the window where 
repeat could
have increased behind your back, or another error was counted as other, when it 
could have
been reported more accurately.


Good point, I will move the write after checking the valid bit.





+}




+static int al_l1_edac_probe(struct platform_device *pdev)
+{
+   struct edac_device_ctl_info *edac_dev;
+   struct device *dev = >dev;
+   int ret;
+
+   edac_dev = edac_device_alloc_ctl_info(0, (char *)dev_name(dev), 1, "L",
+ 1, 1, NULL, 0,
+ edac_device_alloc_index());
+   if (IS_ERR(edac_dev))


edac_device_alloc_ctl_info() returns NULL, or dev_ctl, which comes from 
kzalloc(). I think
you need to check for NULL here, IS_ERR() only catches the -errno range. (there 
is an
IS_ERR_OR_NULL() if you really needed both)


Will fix.





+   return -ENOMEM;



With the header-includes and edac_device_alloc_ctl_info() NULL check:
Reviewed-by: James Morse 


Thanks James.

Thanks,
Hanna



Thanks,

James



Re: [UNVERIFIED SENDER] Re: [RFC 1/1] edac: Add a counter parameter for edac_device_handle_ue/ce()

2019-07-28 Thread Hawa, Hanna




On 7/25/2019 9:36 PM, Mauro Carvalho Chehab wrote:

/* Propagate the count up the 'totals' tree */
-   instance->counters.ue_count++;
-   edac_dev->counters.ue_count++;
+   instance->counters.ue_count += error_count;
+   edac_dev->counters.ue_count += error_count;

Patch itself looks a good idea, but maybe it should rise a WARN()
if error_count == 0.
Good point, shouldn't we use WARN_ONCE here? if the user call 
edac_device_handle_ue() with error count == 0, it not be change in 
run-time, only if the error count parameter is calculated somehow, and 
it'll be the *caller* issue that didn't check the error count.

What you think?



That applies for both CE and UE error logic.

Sure.

Thanks,
Hanna


Thanks,
Mauro




Re: [RFC 1/1] edac: Add a counter parameter for edac_device_handle_ue/ce()

2019-07-17 Thread Hawa, Hanna

Hi Jan,

On 7/17/2019 3:06 PM, Jan Glauber wrote:

Hi Hanna,

I'm probably missing something but this patch looks like while it adds
the error_count parameter the passed values all seem to be 1. So is the
new parameter used otherwise, maybe in another patch?


Yes in another patch. In Amazon L1/L2 edac driver [1], I'm using loop to 
report on multiple L1 or L2 errors. After this patch I'll remove the 
loop and pass the errors count.


[1]: https://lkml.org/lkml/2019/7/15/349

Thanks,
Hanna


thanks,
Jan




Re: [PATCH v2 2/4] edac: Add support for Amazon's Annapurna Labs L1 EDAC

2019-07-09 Thread Hawa, Hanna




On 7/9/2019 12:32 PM, Jonathan Cameron wrote:

Signed-off-by: Hanna Hawa

A quick drive by review as I was feeling curious.

Just a couple of trivial queries and observation on the fact it
might be useful to add a few devm managed functions to cut down
on edac driver boilerplate.

Thanks,

Jonathan


+#define ARM_CA57_CPUMERRSR_VALID   GENMASK(31, 31)

For a single bit it's common to use BIT(31) rather than GENMASK to make
it explicit.


Will fix.





+   edac_dev->mod_name = dev_name(dev);

I'd admit I'm not that familiar with edac, but seems odd that a
module name field would have the dev_name.


Will fix when I got more inputs.




+   edac_device_free_ctl_info(edac_dev);

More a passing observation than a suggestion for this driver, but if there was
ever a place where it looked like a couple of devm_ allocation functions would
be useful, this is it;)

edac_dev = devm_device_alloc_ctrl_info(dev, ...)
...
devm_edac_device_add_device(dev, ...)


I agree.
I can implement the devm_* functions in separate patches as this is not 
related to my patches (and not to delay this patches).







Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-17 Thread Hawa, Hanna




+static void al_a57_edac_l2merrsr(void *arg)
+{



+    edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");


How do we know this is corrected?



If looks like L2CTLR_EL1[20] might force fatal 1/0 to map to 
uncorrected/corrected. Is
this what you are depending on here?



No - not on this. Reporting all the errors as corrected seems to be bad.

Can i be depends on fatal field?


That is described as "set to 1 on the first memory error that caused a Data 
Abort". I
assume this is one of the parity-error external-aborts.

If the repeat counter shows, say, 2, and fatal is set, you only know that at 
least one of
these errors caused an abort. But it could have been all three. The repeat 
counter only
matches against the RAMID and friends, otherwise the error is counted in 
'other'.

I don't think there is a right thing to do here, (other than increase the 
scrubbing
frequency). As you can only feed one error into edac at a time then:


if (fatal)
 edac_device_handle_ue(edac_dev, 0, 0, "L2 Error");
else
 edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");


seems reasonable. You're reporting the most severe, and 'other/repeat' counter 
values just
go missing.

I had print the values of 'other/repeat' to be noticed.





How can L2CTLR_EL1[20] force fatal?


I don't think it can, on a second reading, it looks to be even more complicated 
than I
thought! That bit is described as disabling forwarding of uncorrected data, but 
it looks
like the uncorrected data never actually reaches the other end. (I'm unsure 
what 'flush'
means in this context.)
I was looking for reasons you could 'know' that any reported error was 
corrected. This was
just a bad suggestion!

Is there interrupt for un-correctable error?
Does 'asynchronous errors' in L2 used to report UE?

In case no interrupt, can we use die-notifier subsystem to check if any 
error had occur while system shutdown?



+    cluster = topology_physical_package_id(cpu);


Hmm, I'm not sure cluster==package is guaranteed to be true forever.

If you describe the L2MERRSR_EL1 cpu mapping in your DT you could use that. 
Otherwise
pulling out the DT using something like the arch code's parse_cluster().



I rely on that it's alpine SoC specific driver.


... and that the topology code hasn't changed to really know what a package is:
https://lore.kernel.org/lkml/20190529211340.17087-2-atish.pa...@wdc.com/T/#u

As what you really want to know is 'same L2?', and you're holding the 
cpu_read_lock(),
would struct cacheinfo's shared_cpu_map be a better fit?

This would be done by something like a cpu-mask of cache:shared_cpu_map's for 
the L2's
you've visited. It removes the dependency on package==L2, and insulates you 
from the
cpu-numbering not being exactly as you expect.
I'll add dt property that point to L2-cache node (phandle), then it'll 
be easy to create cpu-mask with all cores that point to same l2 cache.


Thanks,
Hanna




Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-12 Thread Hawa, Hanna

Hi Boris,



Yap, I think we're in agreement here. I believe the important question
is whether you need to get error information from multiple sources
together in order to do proper recovery or doing it per error source
suffices.

And I think the actual use cases could/should dictate our
drivers/orchestrators design.

Thus my question how you guys are planning on tying all that error info
the drivers report, into the whole system design?
We have daemon script that collects correctable/uncorrectable errors 
from EDAC sysfs and reports to Amazon service that allow us to take 
action on specific error thresholds.


Thanks,
Hanna




Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-11 Thread Hawa, Hanna

Hi James,



Allowing linux to access these implementation-defined registers has come up 
before:
https://www.spinics.net/lists/kernel/msg2750349.html

It looks like you've navigated most of the issues. Accessing 
implementation-defined
registers is frowned on, but this stuff can't be done generically until v8.2.
Sure, no planning to do this generally for all ARM a57/a72. I'm doing 
this specific for alpine SoCs.




This can't be done on 'all A57/A72' because some platforms may not have been 
integrated to
have error-checking at L1/L2, (1.3 'Features' of [0] says the ECC protection 
for L1 data
cache etc is optional). Even if they did, this stuff needs turning on in 
L2CTLR_EL1.
These implementation-defined registers may be trapped by the hypervisor, I 
assume for your
platform this is linux booted at EL2, so no hypervisor.
In Alpine-v2/Alpine-v3 Bit[21]-"ECC and parity enable" in L2CTRL_EL1 is 
enabled in the firmware.






+#define ARM_CA57_CPUMERRSR_INDEX_OFF   (0)
+#define ARM_CA57_CPUMERRSR_INDEX_MASK  (0x3)


(GENMASK() would make it quicker and easier to compare this with the datasheet)

Will be used in next patchset.





+#define  ARM_CA57_L2_INC_PLRU_RAM  0x18


A57 describes this one as 'PF RAM'...

will be updated.





Linux supports versions of binutils that choke on this syntax.
See the sys_reg() definitions in arm64's asm/sysreg.h that define something you 
can feed
to read_sysreg_s(). It would save having these wrapper functions.

commit 72c583951526 ("arm64: gicv3: Allow GICv3 compilation with older 
binutils") for the
story.
Great, I'll use sys_reg(), read_sysreg_s(), and remove the wrapper 
functions.






| #define ARM_CA57_CPUMERRSR_VALID  BIT(31)
| if (!(val & ARM_CA57_CPUMERRSR_VALID))

would be easier to read, the same goes for 'fatal' as its a single bit.

Will be fixed, here and in al_a57_edac_l2merrsr.





+   edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");


How do we know this was corrected?

6.4.8 "Error Correction Code" has "Double-bit ECC errors set the fatal bit." in 
a
paragraph talking about the L1 memory system.

I'll check fatal field to check if it's uncorrected/corrected.



"L2 Error" ? Copy and paste?

copy/paste mistake, will be fixed.





+   edac_printk(KERN_CRIT, DRV_NAME, "CPU%d L1 %serror detected\n",
+   cpu, (fatal) ? "Fatal " : "");
+   edac_printk(KERN_CRIT, DRV_NAME, "RAMID=");
+
+   switch (ramid) {
+   case ARM_CA57_L1_I_TAG_RAM:
+   pr_cont("'L1-I Tag RAM' index=%d way=%d", index, way);
+   break;
+   case ARM_CA57_L1_I_DATA_RAM:
+   pr_cont("'L1-I Data RAM' index=%d bank= %d", index, way);
+   break;


Is index/way information really useful? I can't replace way-3 on the system, 
nor can I
stop it being used. If its useless, I'd rather we don't bother parsing and 
printing it out.
I'll remove the index/way information, and keep CPUMERRSR_EL1 value 
print (who want this information can parse the value and get the 
index/bank status)






+   pr_cont(", repeat=%d, other=%d (CPUMERRSR_EL1=0x%llx)\n", repeat, other,
+   val);


'other' here is another error, but we don't know the ramid.
'repeat' is another error for the same ramid.

Could we still feed this stuff into edac? This would make the counters accurate 
if the
polling frequency isn't quite fast enough.
There is no API in EDAC to increase the counters, I can increase by 
accessing the ce_count/ue_count from 
edac_device_ctl_info/edac_device_instance structures if it's okay..






+static void al_a57_edac_l2merrsr(void *arg)
+{



+   edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");


How do we know this is corrected?

If looks like L2CTLR_EL1[20] might force fatal 1/0 to map to 
uncorrected/corrected. Is
this what you are depending on here?

No - not on this. Reporting all the errors as corrected seems to be bad.

Can i be depends on fatal field?

if (fatal)
edac_device_handle_ue(edac_dev, 0, 0, "L2 Error");
else
edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");

How can L2CTLR_EL1[20] force fatal?



(it would be good to have a list of integration-time and firmware dependencies 
this driver
has, for the next person who tries to enable it on their system and complains 
it doesn't
work for them)

Where can i add such information?





+   case ARM_CA57_L2_INC_PLRU_RAM:
+   pr_cont("'L2 Inclusion PLRU RAM'");


The A57 TRM describes this as "Inclusion PF RAM", and notes its only in r1p0 or 
later,
(but doesn't say what it is). The A72 TRM describes the same encoding as 
"Inclusion PLRU
RAM", which is something to do with its replacement policy. It has control bits 
that A57's
version doesn't, so these are not the same thing.

Disambiguating A57/A72 here is a load of faff, 'L2 internal metadata' probably 
covers both
cases, but unless these RAMs are replaceable or can be disabled, there isn't 
much point
working 

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-11 Thread Hawa, Hanna

Hi Ben, Boris

On 6/11/2019 8:50 AM, Benjamin Herrenschmidt wrote:


Anyway, let's get back to the specific case of our Amazon platform here
since it's a concrete example.

Hanna, can you give us a reasonably exhaustive list of how many such
"drivers" we'll want in the EDAC subsystem and whether you envision any
coordination requirement between them or not ?
In the near future we plan to push EDAC drivers for L1/L2 and memory 
controller.

There's no common resources/shared data between them.

Thanks,
Hanna


Cheers,
Ben.





Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-06 Thread Hawa, Hanna




On 5/31/2019 8:14 AM, Borislav Petkov wrote:

On Fri, May 31, 2019 at 01:15:33AM +, Herrenschmidt, Benjamin wrote:

This isn't terribly helpful, there's nothing telling anybody which of
those files corresponds to an ARM SoC :-)


drivers/edac/altera_edac.c is one example.

Also, James and I have a small writeup on how an arm driver should look
like, we just need to polish it up and post it.

James?


That said ...

You really want a single EDAC driver that contains all the stuff for
the caches, the memory controller, etc... ?


Yap.


Disagree. The various drivers don't depend on each other.
I think we should keep the drivers separated as they are distinct and 
independent IP blocks.






Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-03 Thread Hawa, Hanna




On 5/31/2019 4:15 AM, Herrenschmidt, Benjamin wrote:

On Thu, 2019-05-30 at 11:19 -0700, Boris Petkov wrote:

On May 30, 2019 3:15:29 AM PDT, Hanna Hawa  wrote:

Add support for error detection and correction for Amazon's
Annapurna
Labs SoCs for L1/L2 caches.



So this should be a driver for the whole annapurna platform and not
only about the RAS functionality in an IP like the caches. See other
ARM EDAC drivers in drivers/edac/ for an example.


This isn't terribly helpful, there's nothing telling anybody which of
those files corresponds to an ARM SoC :-)

That said ...

You really want a single EDAC driver that contains all the stuff for
the caches, the memory controller, etc... ?

The idea here was to separate the core L1/L2 EDAC from the memory
controller EDAC I think ... Roben, Hanna, can you describe the long run
strategy here ?
Correct our target to separate the L1/L2 EDAC from mc, and to maintain 
both in separate drivers.


Thanks,
Hanna


Cheers,
Ben.



Re: [PATCH 2/7] irqchip/alpine-msi: Update driver license to use SPDX

2019-03-31 Thread Hawa, Hanna




On 3/31/2019 3:46 PM, Mukesh Ojha wrote:


On 3/31/2019 6:04 PM, Hanna Hawa wrote:

Update driver license to be in-line with Linux conventions.

Signed-off-by: Hanna Hawa 



---
  drivers/irqchip/irq-alpine-msi.c | 5 +
  1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/irqchip/irq-alpine-msi.c 
b/drivers/irqchip/irq-alpine-msi.c

index ae2fca7..ec6a606 100644
--- a/drivers/irqchip/irq-alpine-msi.c
+++ b/drivers/irqchip/irq-alpine-msi.c
@@ -1,13 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0



Please fix this as per.
https://lkml.org/lkml/2019/2/13/570

I'll fix in next patch-set, I'll wait for more inputs on other patches 
in the series.


Thanks,
Hanna


cheers,
Mukesh


  /*
   * Annapurna Labs MSIX support services
   *
   * Copyright (C) 2016, Amazon.com, Inc. or its affiliates. All 
Rights Reserved.

   *
   * Antoine Tenart 
- *
- * This file is licensed under the terms of the GNU General Public
- * License version 2. This program is licensed "as is" without any
- * warranty of any kind, whether express or implied.
   */
  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt