Re: [PATCH v4 0/3] Fix pinctrl-single pcs_pin_dbg_show()
On 3/22/2021 7:56 AM, Drew Fustini wrote: I'm curious what SoC are you using? I'm working on Amazon Annapurna Labs SoCs (based on ARM cortex processors). That include multiple pins controlled with same register. It's good to know who has hardware to test bits_per_mux in the future. I pay attention to pinctrl-single as that is the driver used for the TI AM3358 SoC used in a variety of BeagleBone boards. It does not use bits_per_mux, but I can verify that this does not cause any regression for the AM3358 SoC: /sys/kernel/debug/pinctrl/44e10800.pinmux-pinctrl-single# cat pins registered pins: 142 pin 0 (PIN0) 0:? 44e10800 0027 pinctrl-single pin 1 (PIN1) 0:? 44e10804 0027 pinctrl-single pin 2 (PIN2) 0:? 44e10808 0027 pinctrl-single pin 3 (PIN3) 0:? 44e1080c 0027 pinctrl-single pin 4 (PIN4) 0:? 44e10810 0027 pinctrl-single pin 5 (PIN5) 0:? 44e10814 0027 pinctrl-single pin 6 (PIN6) 0:? 44e10818 0027 pinctrl-single pin 7 (PIN7) 0:? 44e1081c 0027 pinctrl-single pin 8 (PIN8) 22:gpio-96-127 44e10820 0027 pinctrl-single pin 9 (PIN9) 23:gpio-96-127 44e10824 0037 pinctrl-single pin 10 (PIN10) 26:gpio-96-127 44e10828 0037 pinctrl-single pin 11 (PIN11) 27:gpio-96-127 44e1082c 0037 pinctrl-single pin 12 (PIN12) 0:? 44e10830 0037 pinctrl-single pin 140 (PIN140) 0:? 44e10a30 0028 pinctrl-single pin 141 (PIN141) 13:gpio-64-95 44e10a34 0020 pinctrl-single Reviewed-by: Drew Fustini Thanks for review and verify the change. Thanks, Hanna Thanks, Drew
Re: [PATCH v3 3/3] pinctrl: pinctrl-single: fix pcs_pin_dbg_show() when bits_per_mux is not zero
On 3/18/2021 2:15 PM, Andy Shevchenko wrote: On Wed, Mar 17, 2021 at 11:42 PM Hanna Hawa wrote: An SError was detected when trying to print the supported pins in a What is SError? Yes, I have read a discussion, but here is the hint: if a person sees this as a first text due to, for example, bisecting an issue, what she/he can get from this cryptic name? What you suggest? s/An SError/A kernel-panic/? Or remove the sentence and keep the below: " This change fixes the pcs_pin_dbg_show() in pinctrl-single driver when bits_per_mux is not zero. In addition move offset calculation and pin offset in register to common function. " pinctrl device which supports multiple pins per register. This change fixes the pcs_pin_dbg_show() in pinctrl-single driver when bits_per_mux is not zero. In addition move offset calculation and pin offset in register to common function. Reviewed-by: Andy Shevchenko Thanks Fixes: 4e7e8017a80e ("pinctrl: pinctrl-single: enhance to configure multiple pins of different modules") Signed-off-by: Hanna Hawa --- drivers/pinctrl/pinctrl-single.c | 57 +--- 1 file changed, 37 insertions(+), 20 deletions(-) diff --git a/drivers/pinctrl/pinctrl-single.c b/drivers/pinctrl/pinctrl-single.c index f3394517cb2e..4595acf6545e 100644 --- a/drivers/pinctrl/pinctrl-single.c +++ b/drivers/pinctrl/pinctrl-single.c @@ -270,20 +270,46 @@ static void __maybe_unused pcs_writel(unsigned val, void __iomem *reg) writel(val, reg); } +static unsigned int pcs_pin_reg_offset_get(struct pcs_device *pcs, + unsigned int pin) +{ + unsigned int mux_bytes; + + mux_bytes = pcs->width / BITS_PER_BYTE; Can be folded to one line. Ack + if (pcs->bits_per_mux) { + unsigned int pin_offset_bytes; + + pin_offset_bytes = (pcs->bits_per_pin * pin) / BITS_PER_BYTE; + return (pin_offset_bytes / mux_bytes) * mux_bytes; Side note for the further improvements (in a separate change, because I see that you just copied an original code, and after all this is just a fix patch): this can be replaced by round down APIs (one which works for arbitrary divisors). Agree, didn't want to change the formula as it's fix patch. (here and below), this can be taken for further improvements. + } + + return pin * mux_bytes; +} + +static unsigned int pcs_pin_shift_reg_get(struct pcs_device *pcs, + unsigned int pin) +{ + return (pin % (pcs->width / pcs->bits_per_pin)) * pcs->bits_per_pin; Also a side note: I'm wondering if this can be optimized to have less divisions. +} + static void pcs_pin_dbg_show(struct pinctrl_dev *pctldev, struct seq_file *s, unsigned pin) { Thanks, Hanna
Re: [PATCH v2 3/3] pinctrl: pinctrl-single: fix pcs_pin_dbg_show() when bits_per_mux != 0
On 3/17/2021 2:27 PM, Andy Shevchenko wrote: On Tue, Mar 16, 2021 at 11:24 PM Hanna Hawa wrote: An SError was detected when trying to print the supported pins in a What SError is? System error: [ 24.257831] SError Interrupt on CPU0, code 0xbf02 -- SError ... [ 24.257855] Kernel panic - not syncing: Asynchronous SError Interrupt pinctrl device which supports multiple pins per register. This change fixes the pcs_pin_dbg_show() in pinctrl-single driver when bits_per_mux != 0. In addition move offset calculation and pin offset in '!= 0' -> 'is not zero' Will be fixed. register to common function. Fixes tag? Sure, will be added. Signed-off-by: Hanna Hawa --- drivers/pinctrl/pinctrl-single.c | 66 ++-- 1 file changed, 45 insertions(+), 21 deletions(-) diff --git a/drivers/pinctrl/pinctrl-single.c b/drivers/pinctrl/pinctrl-single.c index f3394517cb2e..434f90c8b1b3 100644 --- a/drivers/pinctrl/pinctrl-single.c +++ b/drivers/pinctrl/pinctrl-single.c @@ -270,20 +270,53 @@ static void __maybe_unused pcs_writel(unsigned val, void __iomem *reg) writel(val, reg); } +static unsigned int pcs_pin_reg_offset_get(struct pcs_device *pcs, + unsigned int pin) +{ + unsigned int offset, mux_bytes; Offset can be replaced by direct return statements. Ack. + mux_bytes = pcs->width / BITS_PER_BYTE; + + if (pcs->bits_per_mux) { + unsigned int pin_offset_bytes; + + pin_offset_bytes = (pcs->bits_per_pin * pin) / BITS_PER_BYTE; + offset = (pin_offset_bytes / mux_bytes) * mux_bytes; + } else { + offset = pin * mux_bytes; + } + + return offset; +} + +static unsigned int pcs_pin_shift_reg_get(struct pcs_device *pcs, + unsigned int pin) +{ + return ((pin % (pcs->width / pcs->bits_per_pin)) * pcs->bits_per_pin); Too many parentheses. Will remove the extra parentheses. +} + static void pcs_pin_dbg_show(struct pinctrl_dev *pctldev, struct seq_file *s, unsigned pin) { struct pcs_device *pcs; - unsigned val, mux_bytes; + unsigned int val; unsigned long offset; size_t pa; pcs = pinctrl_dev_get_drvdata(pctldev); - mux_bytes = pcs->width / BITS_PER_BYTE; - offset = pin * mux_bytes; - val = pcs->read(pcs->base + offset); + offset = pcs_pin_reg_offset_get(pcs, pin); + + if (pcs->bits_per_mux) { + unsigned int pin_shift_in_reg = pcs_pin_shift_reg_get(pcs, pin); + val = pcs->read(pcs->base + offset) + & (pcs->fmask << pin_shift_in_reg); One line? At least move & to the upper line. + } else { + val = pcs->read(pcs->base + offset); It's the same as in above branch, why not val = read(); if () val &= fmask << _reg_get(...); ? Agree, looks better Thanks Andy, will send new patchset soon. Thanks, Hanna
Re: [PATCH v5 0/6] Amazon's Annapurna Labs Alpine v3 device-tree
On 7/23/2020 1:10 PM, Arnd Bergmann wrote: Hi Hanna and Antoine, I just came across this old series and noticed we had never merged it. I don't know if the patches all still apply. Could you check and perhaps resend to...@kernel.org if they are still good to go into the coming merge window? Hi Arnd, Sure, will rebase the patches and send ASAP. Thanks, Hanna
Re: [PATCH v9 3/3] edac: Add support for Amazon's Annapurna Labs L2 EDAC
On 3/10/2020 3:47 PM, Robert Richter wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. On 29.01.20 21:50:16, Hanna Hawa wrote: Adds support for Amazon's Annapurna Labs L2 EDAC driver to detect and report L2 errors. Signed-off-by: Hanna Hawa --- MAINTAINERS | 5 + drivers/edac/Kconfig | 8 ++ drivers/edac/Makefile | 1 + drivers/edac/al_l2_edac.c | 270 ++ 4 files changed, 284 insertions(+) create mode 100644 drivers/edac/al_l2_edac.c Hanna, most of the review comments by Boris for patch #1 (al_l1_edac) apply here too. Please address them. Hi Boris, Robert, Sorry for not getting back to you sooner, will address comments on both files and upload new patchset ASAP. Thanks, Hanna Thanks, -Robert
Re: [PATCH v6 3/3] edac: Add support for Amazon's Annapurna Labs L2 EDAC
On 10/10/2019 2:09 AM, Rob Herring wrote: +Sudeep On Mon, Oct 7, 2019 at 10:18 AM Hanna Hawa wrote: Adds support for Amazon's Annapurna Labs L2 EDAC driver to detect and report L2 errors. I was curious why you needed a DT cache parsing function... [...] +static int al_l2_edac_probe(struct platform_device *pdev) +{ + struct edac_device_ctl_info *edac_dev; + struct al_l2_edac *al_l2; + struct device *dev = >dev; + int ret, i; + + edac_dev = edac_device_alloc_ctl_info(sizeof(*al_l2), DRV_NAME, 1, "L", + 1, 2, NULL, 0, + edac_device_alloc_index()); + if (!edac_dev) + return -ENOMEM; + + al_l2 = edac_dev->pvt_info; + edac_dev->edac_check = al_l2_edac_check; + edac_dev->dev = dev; + edac_dev->mod_name = DRV_NAME; + edac_dev->dev_name = dev_name(dev); + edac_dev->ctl_name = "L2_cache"; + platform_set_drvdata(pdev, edac_dev); + + INIT_LIST_HEAD(_l2->l2_caches); + + for_each_possible_cpu(i) { + struct device_node *cpu; + struct device_node *cpu_cache; + struct al_l2_cache *l2_cache; + bool found = false; + + cpu = of_get_cpu_node(i, NULL); + if (!cpu) + continue; + + cpu_cache = of_find_next_cache_node(cpu); + list_for_each_entry(l2_cache, _l2->l2_caches, list_node) { + if (l2_cache->of_node == cpu_cache) { + found = true; + break; + } + } + + if (found) { + cpumask_set_cpu(i, _cache->cluster_cpus); + } else { + l2_cache = devm_kzalloc(dev, sizeof(*l2_cache), + GFP_KERNEL); + l2_cache->of_node = cpu_cache; + list_add(_cache->list_node, _l2->l2_caches); + cpumask_set_cpu(i, _cache->cluster_cpus); + } + + of_node_put(cpu); + } We already have what's probably similar code to parse DT and populate cacheinfo data. Does that not work for you? If not, why not and can we extend it? As I saw in cacheinfo it will return the cacheinfo for the online CPUs only, correct me if I'm wrong.. Here I'm parsing all the L2 info for all CPUs depends on DT to get "cluster_cpus", and using smp_call_function_any() will call the online cpu to read the L2MERRSR status register. Then your driver might work if the data comes from ACPI instead (or maybe that's all different, I don't know). No plan to get it work on ACPI, at least in the near future. Rob
Re: [PATCH v6 3/3] edac: Add support for Amazon's Annapurna Labs L2 EDAC
Hi, On 10/10/2019 2:19 AM, Rob Herring wrote: On Mon, Oct 7, 2019 at 10:18 AM Hanna Hawa wrote: Adds support for Amazon's Annapurna Labs L2 EDAC driver to detect and report L2 errors. Signed-off-by: Hanna Hawa --- MAINTAINERS | 5 + drivers/edac/Kconfig | 8 ++ drivers/edac/Makefile | 1 + drivers/edac/al_l2_edac.c | 251 ++ 4 files changed, 265 insertions(+) create mode 100644 drivers/edac/al_l2_edac.c diff --git a/MAINTAINERS b/MAINTAINERS index 7887a62dc843..0eabcfcf91a9 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -748,6 +748,11 @@ M: Hanna Hawa S: Maintained F: drivers/edac/al_l1_edac.c +AMAZON ANNAPURNA LABS L2 EDAC +M: Hanna Hawa +S: Maintained +F: drivers/edac/al_l2_edac.c + AMAZON ANNAPURNA LABS THERMAL MMIO DRIVER M: Talel Shenhar S: Maintained diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig index e8161d7c7469..cb394aff1cab 100644 --- a/drivers/edac/Kconfig +++ b/drivers/edac/Kconfig @@ -82,6 +82,14 @@ config EDAC_AL_L1 for Amazon's Annapurna Labs SoCs. This driver detects errors of L1 caches. +config EDAC_AL_L2 + tristate "Amazon's Annapurna Labs L2 EDAC" I still think this should be an "A57 L2 ECC" driver, but if no one cares I'll shut up and the 2nd person can rename everything. + depends on ARCH_ALPINE || COMPILE_TEST Will be add in next patchset. Maybe it needs an ARM64 dependency too in this case? Yes, it need ARM64 dependency, I'll add. Thanks, Hanna + help + Support for L2 error detection and correction + for Amazon's Annapurna Labs SoCs. + This driver detects errors of L2 caches. + + + ret = platform_driver_register(_l2_edac_driver); + if (ret) { + pr_err("Failed to register %s (%d)\n", DRV_NAME, ret); + return ret; + } + + edac_l2_device = platform_device_register_simple(DRV_NAME, -1, NULL, 0); + if (IS_ERR(edac_l2_device)) { + pr_err("Failed to register EDAC AL L2 platform device\n"); + return PTR_ERR(edac_l2_device); + } + + return 0; +} + +static void __exit al_l2_exit(void) +{ + platform_device_unregister(edac_l2_device); + platform_driver_unregister(_l2_edac_driver); +} + +late_initcall(al_l2_init); +module_exit(al_l2_exit); + +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Hanna Hawa "); +MODULE_DESCRIPTION("Amazon's Annapurna Lab's L2 EDAC Driver"); -- 2.17.1
Re: [PATCH v4 1/2] edac: Add an API for edac device to report for multiple errors
On 9/30/2019 5:50 PM, Borislav Petkov wrote: On Mon, Sep 23, 2019 at 08:17:40PM +0100, Hanna Hawa wrote: +void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev, + int inst_nr, int block_nr, const char *msg) +{ + __edac_device_handle_ce(edac_dev, 1, inst_nr, block_nr, msg); +} +EXPORT_SYMBOL_GPL(edac_device_handle_ce); Eww, I don't like that: exporting the function *and* the __ counterpart. The user will get confused and that is unnecessary. See below for a better version. This way you solve the whole deal with a single patch. I'm okay with this version, minor comment below. --- From: Hanna Hawa Date: Mon, 23 Sep 2019 20:17:40 +0100 Subject: [PATCH] EDAC/device: Rework error logging API Make the main workhorse the "count" functions which can log a @count of errors. Have the current APIs edac_device_handle_{ce,ue}() call the _count() variants and this way keep the exported symbols number unchanged. [ bp: Rewrite. ] Signed-off-by: Hanna Hawa Signed-off-by: Borislav Petkov Cc: b...@amazon.com Cc: d...@amazon.co.uk Cc: hano...@amazon.com Cc: James Morse Cc: jon...@amazon.com Cc: linux-edac Cc: Mauro Carvalho Chehab Cc: ron...@amazon.com Cc: ta...@amazon.com Cc: Tony Luck Link: https://lkml.kernel.org/r/20190923191741.29322-2-hhh...@amazon.com --- drivers/edac/edac_device.c | 44 --- drivers/edac/edac_device.h | 54 ++ 2 files changed, 66 insertions(+), 32 deletions(-) diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c index 65cf2b9355c4..d4d8bed5b55d 100644 --- a/drivers/edac/edac_device.c +++ b/drivers/edac/edac_device.c @@ -555,8 +555,9 @@ static inline int edac_device_get_panic_on_ue(struct edac_device_ctl_info return edac_dev->panic_on_ue; } -void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev, - int inst_nr, int block_nr, const char *msg) +void edac_device_handle_ce_count(struct edac_device_ctl_info *edac_dev, +unsigned int count, int inst_nr, int block_nr, +const char *msg) { struct edac_device_instance *instance; struct edac_device_block *block = NULL; Missing count check, same in *_ue_count(): if (count) return; @@ -582,23 +583,24 @@ void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev, if (instance->nr_blocks > 0) { block = instance->blocks + block_nr; - block->counters.ce_count++; + block->counters.ce_count += count; } /* Propagate the count up the 'totals' tree */ - instance->counters.ce_count++; - edac_dev->counters.ce_count++; + instance->counters.ce_count += count; + edac_dev->counters.ce_count += count; if (edac_device_get_log_ce(edac_dev)) edac_device_printk(edac_dev, KERN_WARNING, - "CE: %s instance: %s block: %s '%s'\n", - edac_dev->ctl_name, instance->name, - block ? block->name : "N/A", msg); + "CE: %s instance: %s block: %s count: %d '%s'\n", + edac_dev->ctl_name, instance->name, + block ? block->name : "N/A", count, msg); } -EXPORT_SYMBOL_GPL(edac_device_handle_ce); +EXPORT_SYMBOL_GPL(edac_device_handle_ce_count); -void edac_device_handle_ue(struct edac_device_ctl_info *edac_dev, - int inst_nr, int block_nr, const char *msg) +void edac_device_handle_ue_count(struct edac_device_ctl_info *edac_dev, +unsigned int count, int inst_nr, int block_nr, +const char *msg) { struct edac_device_instance *instance; struct edac_device_block *block = NULL; @@ -624,22 +626,22 @@ void edac_device_handle_ue(struct edac_device_ctl_info *edac_dev, if (instance->nr_blocks > 0) { block = instance->blocks + block_nr; - block->counters.ue_count++; + block->counters.ue_count += count; } /* Propagate the count up the 'totals' tree */ - instance->counters.ue_count++; - edac_dev->counters.ue_count++; + instance->counters.ue_count += count; + edac_dev->counters.ue_count += count; if (edac_device_get_log_ue(edac_dev)) edac_device_printk(edac_dev, KERN_EMERG, - "UE: %s instance: %s block: %s '%s'\n", - edac_dev->ctl_name, instance->name, - block ? block->name : "N/A", msg); + "UE: %s instance: %s block: %s count: %d '%s'\n", + edac_dev->ctl_name, instance->name, + block ? block->name : "N/A", count, msg); if
Re: [PATCH v3 1/2] edac: Add an API for edac device to report for multiple errors
On 9/20/2019 9:42 AM, Robert Richter wrote: On 19.09.19 18:17:12, Hanna Hawa wrote: Add an API for EDAC device to report multiple errors with same type. Signed-off-by: Hanna Hawa With the change below it looks good to me: Acked-by: Robert Richter Thanks Thanks, -Robert --- drivers/edac/edac_device.c | 62 ++ drivers/edac/edac_device.h | 40 2 files changed, 82 insertions(+), 20 deletions(-) diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c index 65cf2b9355c4..866934f2bcb0 100644 --- a/drivers/edac/edac_device.c +++ b/drivers/edac/edac_device.c @@ -555,12 +555,16 @@ static inline int edac_device_get_panic_on_ue(struct edac_device_ctl_info return edac_dev->panic_on_ue; } -void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev, - int inst_nr, int block_nr, const char *msg) +void __edac_device_handle_ce(struct edac_device_ctl_info *edac_dev, +unsigned int count, int inst_nr, int block_nr, +const char *msg) { struct edac_device_instance *instance; struct edac_device_block *block = NULL; + if (!count) + return; + Those checks should be moved to the *_count() variants of both functions. Will be moved to the inline functions. [...] +static inline void edac_device_handle_ce_count(struct edac_device_ctl_info *edac_dev, + unsigned int count, int inst_nr, + int block_nr, const char *msg) +{ if (count) ... + __edac_device_handle_ce(edac_dev, count, inst_nr, block_nr, msg); +} + +static inline void edac_device_handle_ue_count(struct edac_device_ctl_info *edac_dev, + unsigned int count, int inst_nr, + int block_nr, const char *msg) +{ Here too. + __edac_device_handle_ue(edac_dev, count, inst_nr, block_nr, msg); +} Thanks, Hanna
Re: [PATCH v3 2/2] edac: move edac_device_handle_*() API functions to header
On 9/20/2019 9:49 AM, Robert Richter wrote: On 19.09.19 18:17:13, Hanna Hawa wrote: Move edac_device_handle_*() functions from source file to header file as inline funtcion that use the new API with single error. Signed-off-by: Hanna Hawa With the changes below it looks good to me: Acked-by: Robert Richter Thanks, -Robert diff --git a/drivers/edac/edac_device.h b/drivers/edac/edac_device.h index 30dc5f5979c8..796ea134a691 100644 --- a/drivers/edac/edac_device.h +++ b/drivers/edac/edac_device.h @@ -285,29 +285,6 @@ extern int edac_device_add_device(struct edac_device_ctl_info *edac_dev); */ extern struct edac_device_ctl_info *edac_device_del_device(struct device *dev); -/** - * edac_device_handle_ue(): - * perform a common output and handling of an 'edac_dev' UE event - * - * @edac_dev: pointer to struct _device_ctl_info - * @inst_nr: number of the instance where the UE error happened - * @block_nr: number of the block where the UE error happened - * @msg: message to be printed - */ -extern void edac_device_handle_ue(struct edac_device_ctl_info *edac_dev, - int inst_nr, int block_nr, const char *msg); -/** - * edac_device_handle_ce(): - * perform a common output and handling of an 'edac_dev' CE event - * - * @edac_dev: pointer to struct _device_ctl_info - * @inst_nr: number of the instance where the CE error happened - * @block_nr: number of the block where the CE error happened - * @msg: message to be printed - */ -extern void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev, - int inst_nr, int block_nr, const char *msg); - Just put in the inline replacement here. I'll re-arrange the functions in patches 1/2 and put the *edac_device_handle_* functions here instead of end of file. /** * edac_device_alloc_index: Allocate a unique device index number * @@ -357,4 +334,18 @@ static inline void edac_device_handle_ue_count(struct edac_device_ctl_info *edac { __edac_device_handle_ue(edac_dev, count, inst_nr, block_nr, msg); } + +static inline void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev, +int inst_nr, int block_nr, No need for this linebreak. It'll be more than 80 characters. +const char *msg) +{ + __edac_device_handle_ce(edac_dev, 1, inst_nr, block_nr, msg); +} + +static inline void edac_device_handle_ue(struct edac_device_ctl_info *edac_dev, +int inst_nr, int block_nr, Same here. +const char *msg) +{ + __edac_device_handle_ue(edac_dev, 1, inst_nr, block_nr, msg); +} #endif -- 2.17.1
Re: [PATCH v2 1/2] edac: Add an API for edac device to report for multiple errors
On 9/19/2019 9:33 AM, Robert Richter wrote: On 12.09.19 15:53:04, Hanna Hawa wrote: Add an API for EDAC device to report multiple errors with same type. Signed-off-by: Hanna Hawa --- drivers/edac/edac_device.c | 91 ++ drivers/edac/edac_device.h | 40 + 2 files changed, 131 insertions(+) diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c index 65cf2b9355c4..78ac44103acc 100644 --- a/drivers/edac/edac_device.c +++ b/drivers/edac/edac_device.c @@ -643,3 +643,94 @@ void edac_device_handle_ue(struct edac_device_ctl_info *edac_dev, block ? block->name : "N/A", msg); } EXPORT_SYMBOL_GPL(edac_device_handle_ue); + +void __edac_device_handle_ce(struct edac_device_ctl_info *edac_dev, +unsigned int count, int inst_nr, int block_nr, +const char *msg) +{ Please do not add a copy here, instead modify the existing function and share the code with both, old and new functions. Will be fixed. Thanks, -Robert
Re: [PATCH 1/1] edac: Add an API for edac device to report for multiple errors
Hi On 9/8/2019 11:35 AM, Borislav Petkov wrote: On Sun, Sep 08, 2019 at 10:16:02AM +0200, Borislav Petkov wrote: On Sun, Sep 08, 2019 at 10:58:31AM +0300, Hawa, Hanna wrote: Better use WARN_ON_ONCE() to avoid flooding. In case of two drivers using this function with wrong error count, only the first WARN_ON_ONCE will catch in this case, and other will miss other wrong usage of other edac device drivers. The idea is to catch any driver using a 0 error count and fix it, not to flood dmesg. You want _ONCE. ... and you want to return early too, i.e., if (WARN_ON_ONCE(!error_count)) return; Frankly, I'd even remove all the warning functionality and simply do if (!error_count) return; I'll keep it simple as you suggest and remove the warning functionality. but let's see how much it screams first. Thanks, Hanna
Re: [PATCH 1/1] edac: Add an API for edac device to report for multiple errors
On 9/5/2019 12:56 PM, Robert Richter wrote: Hi Hanna, thanks for the update. See below. On 05.09.19 09:37:45, Hanna Hawa wrote: Add an API for edac device to report multiple errors with same type. Signed-off-by: Hanna Hawa --- drivers/edac/edac_device.c | 66 +- drivers/edac/edac_device.h | 31 -- 2 files changed, 79 insertions(+), 18 deletions(-) diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c index 65cf2b9355c4..bf6a4fd9831b 100644 --- a/drivers/edac/edac_device.c +++ b/drivers/edac/edac_device.c @@ -555,12 +555,15 @@ static inline int edac_device_get_panic_on_ue(struct edac_device_ctl_info return edac_dev->panic_on_ue; } -void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev, - int inst_nr, int block_nr, const char *msg) +static void __edac_device_handle_ce(struct edac_device_ctl_info *edac_dev, + u16 error_count, int inst_nr, int block_nr, Just curious, why u16, some register mask size? Maybe just use unsigned int? Wanted to be aligned with edac MC. I can change it to be u32. I think the variable can be shortened to 'count', the meaning should still be clear. I think more clear to include 'error'. maybe shorter name 'err_count'? + const char *msg) { struct edac_device_instance *instance; struct edac_device_block *block = NULL; + WARN_ON(!error_count); Should return in this case. Better use WARN_ON_ONCE() to avoid flooding. In case of two drivers using this function with wrong error count, only the first WARN_ON_ONCE will catch in this case, and other will miss other wrong usage of other edac device drivers. The check should be moved to edac_device_handle_ce_count(). I'll move it to edac_device_handle_ce_count. + if ((inst_nr >= edac_dev->nr_instances) || (inst_nr < 0)) { edac_device_printk(edac_dev, KERN_ERR, "INTERNAL ERROR: 'instance' out of range " @@ -582,27 +585,44 @@ void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev, if (instance->nr_blocks > 0) { block = instance->blocks + block_nr; - block->counters.ce_count++; + block->counters.ce_count += error_count; } /* Propagate the count up the 'totals' tree */ - instance->counters.ce_count++; - edac_dev->counters.ce_count++; + instance->counters.ce_count += error_count; + edac_dev->counters.ce_count += error_count; if (edac_device_get_log_ce(edac_dev)) edac_device_printk(edac_dev, KERN_WARNING, - "CE: %s instance: %s block: %s '%s'\n", + "CE: %s instance: %s block: %s count: %d '%s'\n", edac_dev->ctl_name, instance->name, - block ? block->name : "N/A", msg); + block ? block->name : "N/A", error_count, msg); +} + +void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev, + int inst_nr, int block_nr, const char *msg) +{ + __edac_device_handle_ce(edac_dev, 1, inst_nr, block_nr, msg); } EXPORT_SYMBOL_GPL(edac_device_handle_ce); We could just export the __*() version of those functions and make everything else inline in the header file? Though, better do this with two patches to avoid an ABI breakage in case someone wants to backport it. Let's see what others say here. Waiting for other reviewers. -void edac_device_handle_ue(struct edac_device_ctl_info *edac_dev, - int inst_nr, int block_nr, const char *msg) +void edac_device_handle_ce_count(struct edac_device_ctl_info *edac_dev, +u16 error_count, int inst_nr, int block_nr, +const char *msg) +{ + __edac_device_handle_ce(edac_dev, error_count, inst_nr, block_nr, msg); +} +EXPORT_SYMBOL_GPL(edac_device_handle_ce_count); + +static void __edac_device_handle_ue(struct edac_device_ctl_info *edac_dev, + u16 error_count, int inst_nr, int block_nr, + const char *msg) All the above applies for this function too. { struct edac_device_instance *instance; struct edac_device_block *block = NULL; + WARN_ON(!error_count); + if ((inst_nr >= edac_dev->nr_instances) || (inst_nr < 0)) { edac_device_printk(edac_dev, KERN_ERR, "INTERNAL ERROR: 'instance' out of range " @@ -624,22 +644,36 @@ void edac_device_handle_ue(struct edac_device_ctl_info *edac_dev, if (instance->nr_blocks > 0) { block = instance->blocks + block_nr; - block->counters.ue_count++; + block->counters.ue_count += error_count; } /* Propagate the count up the
Re: [PATCH v3 4/4] edac: Add support for Amazon's Annapurna Labs L2 EDAC
On 9/3/2019 10:27 AM, Robert Richter wrote: On 15.07.19 16:24:09, Hanna Hawa wrote: Adds support for Amazon's Annapurna Labs L2 EDAC driver to detect and report L2 errors. Signed-off-by: Hanna Hawa --- MAINTAINERS | 6 ++ drivers/edac/Kconfig | 8 ++ drivers/edac/Makefile | 1 + drivers/edac/al_l2_edac.c | 187 ++ 4 files changed, 202 insertions(+) create mode 100644 drivers/edac/al_l2_edac.c From a brief look at it, it seems some of my comments from 2/4 apply here too. Please look through it. Thanks for your review, will look and fix on top of v5. Thanks, Hanna -Robert
Re: [PATCH v3 2/4] edac: Add support for Amazon's Annapurna Labs L1 EDAC
On 9/3/2019 10:24 AM, Robert Richter wrote: On 15.07.19 16:24:07, Hanna Hawa wrote: Adds support for Amazon's Annapurna Labs L1 EDAC driver to detect and report L1 errors. Signed-off-by: Hanna Hawa Reviewed-by: James Morse --- MAINTAINERS | 6 ++ drivers/edac/Kconfig | 8 +++ drivers/edac/Makefile | 1 + drivers/edac/al_l1_edac.c | 156 ++ 4 files changed, 171 insertions(+) create mode 100644 drivers/edac/al_l1_edac.c diff --git a/drivers/edac/al_l1_edac.c b/drivers/edac/al_l1_edac.c new file mode 100644 index 000..70510ea --- /dev/null +++ b/drivers/edac/al_l1_edac.c [...] +static void al_l1_edac_cpumerrsr(void *arg) Could this being named to something meaningful, such as *_read_status() or so? +{ + struct edac_device_ctl_info *edac_dev = arg; + int cpu, i; + u32 ramid, repeat, other, fatal; + u64 val = read_sysreg_s(ARM_CA57_CPUMERRSR_EL1); + char msg[AL_L1_EDAC_MSG_MAX]; + int space, count; + char *p; + + if (!(FIELD_GET(ARM_CA57_CPUMERRSR_VALID, val))) + return; [...] +static void al_l1_edac_check(struct edac_device_ctl_info *edac_dev) +{ + on_each_cpu(al_l1_edac_cpumerrsr, edac_dev, 1); +} + +static int al_l1_edac_probe(struct platform_device *pdev) +{ + struct edac_device_ctl_info *edac_dev; + struct device *dev = >dev; + int ret; + + edac_dev = edac_device_alloc_ctl_info(0, (char *)dev_name(dev), 1, "L", This type cast looks broken. dev_name() is a constant string already. Other drivers do not use the dynamically generated dev_name() string here, instead a fix string such as mod_name or ctl_name could be used. edac_device_alloc_ctl_info() later generates a unique instance name derived from name + index. Ack, will fix and use DRV_NAME. Regarding the type, this seems to be an API issue of edac_device_ alloc_ctl_info() that should actually use const char* in its interface. So if needed (from what I wrote above it is not) the type in the argument list needs to be changed instead. + 1, 1, NULL, 0, + edac_device_alloc_index()); + if (IS_ERR(edac_dev)) + return -ENOMEM; Use the original error code instead. Actually it return NULL in case of failure, it was changed in v5 to check if error/NULL. + + edac_dev->edac_check = al_l1_edac_check; + edac_dev->dev = dev; + edac_dev->mod_name = DRV_NAME; + edac_dev->dev_name = dev_name(dev); + edac_dev->ctl_name = "L1 cache"; Should not contain spaces and maybe a bit more specific. L1_cache_ecc_err? or L1_cache_err? + platform_set_drvdata(pdev, edac_dev); + + ret = edac_device_add_device(edac_dev); + if (ret) { + dev_err(dev, "Failed to add L1 edac device\n"); Move this printk below to the error path and maybe print the error code. You do not cover the -ENOMEM failure. Ack. -Robert + goto err; + } + + return 0; +err: + edac_device_free_ctl_info(edac_dev); + + return ret; +}
Re: [PATCH v5 1/4] dt-bindings: EDAC: Add Amazon's Annapurna Labs L1 EDAC
On 8/21/2019 10:17 PM, Rob Herring wrote: Why is this even in DT? AFAICT, this is all just CortexA57 core features (i.e. nothing Amazon specific). The core type and the ECC capabilities are discoverable. Added to the DT in order to easily enable/disable the driver. You are correct that they are CortexA57 core features and nothing Amazon specific, but it's IMPLEMENTATION DEFINED, meaning that in different cortex revisions (e.g. A57) the register bitmap may change. Because of that we added an Amazon compatible which corresponds to the specific core we are using. Thanks, Hanna Rob
Re: [PATCH v4 4/4] edac: Add support for Amazon's Annapurna Labs L2 EDAC
On 8/2/2019 6:11 PM, James Morse wrote: Hi Hanna, On 01/08/2019 14:09, Hanna Hawa wrote: Adds support for Amazon's Annapurna Labs L2 EDAC driver to detect and report L2 errors. diff --git a/drivers/edac/al_l2_edac.c b/drivers/edac/al_l2_edac.c new file mode 100644 index ..6c6d37cf82ab --- /dev/null +++ b/drivers/edac/al_l2_edac.c @@ -0,0 +1,189 @@ +#include +#include #include ? Will be added. +#include +#include [...] +static void al_l2_edac_l2merrsr(void *arg) +{ + struct edac_device_ctl_info *edac_dev = arg; + int cpu, i; + u32 ramid, repeat, other, fatal; + u64 val = read_sysreg_s(ARM_CA57_L2MERRSR_EL1); + char msg[AL_L2_EDAC_MSG_MAX]; + int space, count; + char *p; + + if (!(FIELD_GET(ARM_CA57_L2MERRSR_VALID, val))) + return; + + write_sysreg_s(0, ARM_CA57_L2MERRSR_EL1); + + cpu = smp_processor_id(); + ramid = FIELD_GET(ARM_CA57_L2MERRSR_RAMID, val); + repeat = FIELD_GET(ARM_CA57_L2MERRSR_REPEAT, val); + other = FIELD_GET(ARM_CA57_L2MERRSR_OTHER, val); + fatal = FIELD_GET(ARM_CA57_L2MERRSR_FATAL, val); + + space = sizeof(msg); + p = msg; + count = scnprintf(p, space, "CPU%d L2 %serror detected", cpu, + (fatal) ? "Fatal " : ""); + p += count; + space -= count; + + switch (ramid) { + case ARM_CA57_L2_TAG_RAM: + count = scnprintf(p, space, " RAMID='L2 Tag RAM'"); + break; + case ARM_CA57_L2_DATA_RAM: + count = scnprintf(p, space, " RAMID='L2 Data RAM'"); + break; + case ARM_CA57_L2_SNOOP_RAM: + count = scnprintf(p, space, " RAMID='L2 Snoop RAM'"); Nit: The TRMs both call this 'L2 Snoop Tag RAM'. Could we include 'tag' in the description. 'tag' implies its some kind of metadata, so an uncorrected error here affect a now unknown location, its more series than a 'data RAM' error. v8.2 would term this kind of error 'uncontained'. Ack, will be fixed. + break; + case ARM_CA57_L2_DIRTY_RAM: + count = scnprintf(p, space, " RAMID='L2 Dirty RAM'"); + break; + case ARM_CA57_L2_INC_PF_RAM: + count = scnprintf(p, space, " RAMID='L2 internal metadat'"); Nit: metadata Ack, will be fixed. + break; + default: + count = scnprintf(p, space, " RAMID='unknown'"); + break; + } + + p += count; + space -= count; + + count = scnprintf(p, space, + " repeat=%d, other=%d (L2MERRSR_EL1=0x%llx)", + repeat, other, val); + + for (i = 0; i < repeat; i++) { + if (fatal) + edac_device_handle_ue(edac_dev, 0, 0, msg); + else + edac_device_handle_ce(edac_dev, 0, 0, msg); + } +} [...] +static int al_l2_edac_probe(struct platform_device *pdev) +{ + struct edac_device_ctl_info *edac_dev; + struct al_l2_edac *al_l2; + struct device *dev = >dev; + int ret, i; + + edac_dev = edac_device_alloc_ctl_info(sizeof(*al_l2), + (char *)dev_name(dev), 1, "L", 1, + 2, NULL, 0, + edac_device_alloc_index()); + if (IS_ERR_OR_NULL(edac_dev)) + return -ENOMEM; + + al_l2 = edac_dev->pvt_info; + edac_dev->edac_check = al_l2_edac_check; + edac_dev->dev = dev; + edac_dev->mod_name = DRV_NAME; + edac_dev->dev_name = dev_name(dev); + edac_dev->ctl_name = "L2 cache"; + platform_set_drvdata(pdev, edac_dev); + for_each_online_cpu(i) { for_each_possible_cpu()? If you boot with maxcpus= the driver's behaviour changes. But you are only parsing information from the DT, so you don't really need the CPUs to be online. Agree, for dt parsing no need the online CPUs, if for_each_possible_cpu() used then smp_call_function_any() will run only on the online CPUs in the mask. Will change to for_each_possible_cpu(). + struct device_node *cpu; + struct device_node *cpu_cache, *l2_cache; + + cpu = of_get_cpu_node(i, NULL); (of_get_cpu_node() can return NULL, but I don't think it can ever happen like this) + cpu_cache = of_find_next_cache_node(cpu); + l2_cache = of_parse_phandle(dev->of_node, "l2-cache", 0); + + if (cpu_cache == l2_cache) + cpumask_set_cpu(i, _l2->cluster_cpus); You need to of_node_put() these device_node pointers once you're done with them. Will be added. + } + + if (cpumask_empty(_l2->cluster_cpus)) { + dev_err(dev, "CPU mask is empty for this L2 cache\n"); + ret = -EINVAL; + goto err; +
Re: [RFC 1/1] edac: Add a counter parameter for edac_device_handle_ue/ce()
On 8/1/2019 5:17 PM, Robert Richter wrote: Don't you think it'll be confused to have different APIs between EDAC_MC and EDAC_DEVICE? (in MC the count passed as part of edac_mc_handle_error()) I don't think edac_mc_handle_error() with 11 function arguments is a good reference for somethin we want to adopt. For the majority of drivers you just introduce another useless argument with the following pattern: edac_device_handle_ce(edac_dev, 1, 0, 0, edac_dev_name); IMO, the api should be improved when touching it. Got it, I'll update the patch as you suggested. Thanks, Hanna -Robert
Re: [RFC 1/1] edac: Add a counter parameter for edac_device_handle_ue/ce()
On 8/1/2019 2:35 PM, Robert Richter wrote: On 15.07.19 13:53:07, Hanna Hawa wrote: Add a counter parameter in order to avoid losing errors count for edac device, the error count reports the number of errors reported by an edac device similar to the way MC_EDAC do. Signed-off-by: Hanna Hawa --- drivers/edac/altera_edac.c | 20 drivers/edac/amd8111_edac.c | 6 +++--- drivers/edac/cpc925_edac.c | 4 ++-- drivers/edac/edac_device.c | 18 ++ drivers/edac/edac_device.h | 8 ++-- drivers/edac/highbank_l2_edac.c | 4 ++-- drivers/edac/mpc85xx_edac.c | 4 ++-- drivers/edac/mv64x60_edac.c | 4 ++-- drivers/edac/octeon_edac-l2c.c | 20 ++-- drivers/edac/octeon_edac-pc.c | 6 +++--- drivers/edac/qcom_edac.c| 8 drivers/edac/thunderx_edac.c| 10 +- drivers/edac/xgene_edac.c | 26 +- 13 files changed, 74 insertions(+), 64 deletions(-) diff --git a/drivers/edac/edac_device.h b/drivers/edac/edac_device.h index 1aaba74..cf1a1da 100644 --- a/drivers/edac/edac_device.h +++ b/drivers/edac/edac_device.h @@ -290,23 +290,27 @@ extern struct edac_device_ctl_info *edac_device_del_device(struct device *dev); *perform a common output and handling of an 'edac_dev' UE event * * @edac_dev: pointer to struct _device_ctl_info + * @error_count: number of errors of the same type * @inst_nr: number of the instance where the UE error happened * @block_nr: number of the block where the UE error happened * @msg: message to be printed */ extern void edac_device_handle_ue(struct edac_device_ctl_info *edac_dev, - int inst_nr, int block_nr, const char *msg); + u16 error_count, int inst_nr, int block_nr, + const char *msg); /** * edac_device_handle_ce(): *perform a common output and handling of an 'edac_dev' CE event * * @edac_dev: pointer to struct _device_ctl_info + * @error_count: number of errors of the same type * @inst_nr: number of the instance where the CE error happened * @block_nr: number of the block where the CE error happened * @msg: message to be printed */ extern void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev, How about renaming this to __edac_device_handle_ce() and then have 2 macros for: * edac_device_handle_ce() to keep old i/f. * edac_device_handle_ce_count(), with count parameter added. Same for uncorrectable errors. Code of other driver can be kept as it is then. Don't you think it'll be confused to have different APIs between EDAC_MC and EDAC_DEVICE? (in MC the count passed as part of edac_mc_handle_error()) I don't have strong objection, the change for other drivers is not that hard. Thanks, -Robert - int inst_nr, int block_nr, const char *msg); + u16 error_count, int inst_nr, int block_nr, + const char *msg);
Re: [PATCH v3 2/4] edac: Add support for Amazon's Annapurna Labs L1 EDAC
On 7/26/2019 7:49 PM, James Morse wrote: Hi Hanna, On 15/07/2019 14:24, Hanna Hawa wrote: Adds support for Amazon's Annapurna Labs L1 EDAC driver to detect and report L1 errors. diff --git a/drivers/edac/al_l1_edac.c b/drivers/edac/al_l1_edac.c new file mode 100644 index 000..70510ea --- /dev/null +++ b/drivers/edac/al_l1_edac.c @@ -0,0 +1,156 @@ +#include You need for on-each_cpu(). +#include "edac_device.h" +#include "edac_module.h" You need for the sys_reg() macro. The ARCH_ALPINE dependency doesn't stop this from being built on 32bit arm, where this sys_reg() won't work/exist. Will fix. [...] +static void al_l1_edac_cpumerrsr(void *arg) +{ + struct edac_device_ctl_info *edac_dev = arg; + int cpu, i; + u32 ramid, repeat, other, fatal; + u64 val = read_sysreg_s(ARM_CA57_CPUMERRSR_EL1); + char msg[AL_L1_EDAC_MSG_MAX]; + int space, count; + char *p; + if (!(FIELD_GET(ARM_CA57_CPUMERRSR_VALID, val))) + return; + space = sizeof(msg); + p = msg; + count = snprintf(p, space, "CPU%d L1 %serror detected", cpu, +(fatal) ? "Fatal " : ""); + p += count; + space -= count; snprintf() will return the number of characters it would have generated, even if that is more than space. If this happen, space becomes negative, p points outside msg[] and msg[] isn't NULL terminated... It looks like you want scnprintf(), which returns the number of characters written to buf instead. (I don't see how 256 characters would be printed by this code) Will use scnprintf() instead. + switch (ramid) { + case ARM_CA57_L1_I_TAG_RAM: + count = snprintf(p, space, " RAMID='L1-I Tag RAM'"); + break; + case ARM_CA57_L1_I_DATA_RAM: + count = snprintf(p, space, " RAMID='L1-I Data RAM'"); + break; + case ARM_CA57_L1_D_TAG_RAM: + count = snprintf(p, space, " RAMID='L1-D Tag RAM'"); + break; + case ARM_CA57_L1_D_DATA_RAM: + count = snprintf(p, space, " RAMID='L1-D Data RAM'"); + break; + case ARM_CA57_L2_TLB_RAM: + count = snprintf(p, space, " RAMID='L2 TLB RAM'"); + break; + default: + count = snprintf(p, space, " RAMID='unknown'"); + break; + } + + p += count; + space -= count; + count = snprintf(p, space, +" repeat=%d, other=%d (CPUMERRSR_EL1=0x%llx)", +repeat, other, val); + + for (i = 0; i < repeat; i++) { + if (fatal) + edac_device_handle_ue(edac_dev, 0, 0, msg); + else + edac_device_handle_ce(edac_dev, 0, 0, msg); + } + + write_sysreg_s(0, ARM_CA57_CPUMERRSR_EL1); Writing 0 just after you've read the value would minimise the window where repeat could have increased behind your back, or another error was counted as other, when it could have been reported more accurately. Good point, I will move the write after checking the valid bit. +} +static int al_l1_edac_probe(struct platform_device *pdev) +{ + struct edac_device_ctl_info *edac_dev; + struct device *dev = >dev; + int ret; + + edac_dev = edac_device_alloc_ctl_info(0, (char *)dev_name(dev), 1, "L", + 1, 1, NULL, 0, + edac_device_alloc_index()); + if (IS_ERR(edac_dev)) edac_device_alloc_ctl_info() returns NULL, or dev_ctl, which comes from kzalloc(). I think you need to check for NULL here, IS_ERR() only catches the -errno range. (there is an IS_ERR_OR_NULL() if you really needed both) Will fix. + return -ENOMEM; With the header-includes and edac_device_alloc_ctl_info() NULL check: Reviewed-by: James Morse Thanks James. Thanks, Hanna Thanks, James
Re: [UNVERIFIED SENDER] Re: [RFC 1/1] edac: Add a counter parameter for edac_device_handle_ue/ce()
On 7/25/2019 9:36 PM, Mauro Carvalho Chehab wrote: /* Propagate the count up the 'totals' tree */ - instance->counters.ue_count++; - edac_dev->counters.ue_count++; + instance->counters.ue_count += error_count; + edac_dev->counters.ue_count += error_count; Patch itself looks a good idea, but maybe it should rise a WARN() if error_count == 0. Good point, shouldn't we use WARN_ONCE here? if the user call edac_device_handle_ue() with error count == 0, it not be change in run-time, only if the error count parameter is calculated somehow, and it'll be the *caller* issue that didn't check the error count. What you think? That applies for both CE and UE error logic. Sure. Thanks, Hanna Thanks, Mauro
Re: [RFC 1/1] edac: Add a counter parameter for edac_device_handle_ue/ce()
Hi Jan, On 7/17/2019 3:06 PM, Jan Glauber wrote: Hi Hanna, I'm probably missing something but this patch looks like while it adds the error_count parameter the passed values all seem to be 1. So is the new parameter used otherwise, maybe in another patch? Yes in another patch. In Amazon L1/L2 edac driver [1], I'm using loop to report on multiple L1 or L2 errors. After this patch I'll remove the loop and pass the errors count. [1]: https://lkml.org/lkml/2019/7/15/349 Thanks, Hanna thanks, Jan
Re: [PATCH v2 2/4] edac: Add support for Amazon's Annapurna Labs L1 EDAC
On 7/9/2019 12:32 PM, Jonathan Cameron wrote: Signed-off-by: Hanna Hawa A quick drive by review as I was feeling curious. Just a couple of trivial queries and observation on the fact it might be useful to add a few devm managed functions to cut down on edac driver boilerplate. Thanks, Jonathan +#define ARM_CA57_CPUMERRSR_VALID GENMASK(31, 31) For a single bit it's common to use BIT(31) rather than GENMASK to make it explicit. Will fix. + edac_dev->mod_name = dev_name(dev); I'd admit I'm not that familiar with edac, but seems odd that a module name field would have the dev_name. Will fix when I got more inputs. + edac_device_free_ctl_info(edac_dev); More a passing observation than a suggestion for this driver, but if there was ever a place where it looked like a couple of devm_ allocation functions would be useful, this is it;) edac_dev = devm_device_alloc_ctrl_info(dev, ...) ... devm_edac_device_add_device(dev, ...) I agree. I can implement the devm_* functions in separate patches as this is not related to my patches (and not to delay this patches).
Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC
+static void al_a57_edac_l2merrsr(void *arg) +{ + edac_device_handle_ce(edac_dev, 0, 0, "L2 Error"); How do we know this is corrected? If looks like L2CTLR_EL1[20] might force fatal 1/0 to map to uncorrected/corrected. Is this what you are depending on here? No - not on this. Reporting all the errors as corrected seems to be bad. Can i be depends on fatal field? That is described as "set to 1 on the first memory error that caused a Data Abort". I assume this is one of the parity-error external-aborts. If the repeat counter shows, say, 2, and fatal is set, you only know that at least one of these errors caused an abort. But it could have been all three. The repeat counter only matches against the RAMID and friends, otherwise the error is counted in 'other'. I don't think there is a right thing to do here, (other than increase the scrubbing frequency). As you can only feed one error into edac at a time then: if (fatal) edac_device_handle_ue(edac_dev, 0, 0, "L2 Error"); else edac_device_handle_ce(edac_dev, 0, 0, "L2 Error"); seems reasonable. You're reporting the most severe, and 'other/repeat' counter values just go missing. I had print the values of 'other/repeat' to be noticed. How can L2CTLR_EL1[20] force fatal? I don't think it can, on a second reading, it looks to be even more complicated than I thought! That bit is described as disabling forwarding of uncorrected data, but it looks like the uncorrected data never actually reaches the other end. (I'm unsure what 'flush' means in this context.) I was looking for reasons you could 'know' that any reported error was corrected. This was just a bad suggestion! Is there interrupt for un-correctable error? Does 'asynchronous errors' in L2 used to report UE? In case no interrupt, can we use die-notifier subsystem to check if any error had occur while system shutdown? + cluster = topology_physical_package_id(cpu); Hmm, I'm not sure cluster==package is guaranteed to be true forever. If you describe the L2MERRSR_EL1 cpu mapping in your DT you could use that. Otherwise pulling out the DT using something like the arch code's parse_cluster(). I rely on that it's alpine SoC specific driver. ... and that the topology code hasn't changed to really know what a package is: https://lore.kernel.org/lkml/20190529211340.17087-2-atish.pa...@wdc.com/T/#u As what you really want to know is 'same L2?', and you're holding the cpu_read_lock(), would struct cacheinfo's shared_cpu_map be a better fit? This would be done by something like a cpu-mask of cache:shared_cpu_map's for the L2's you've visited. It removes the dependency on package==L2, and insulates you from the cpu-numbering not being exactly as you expect. I'll add dt property that point to L2-cache node (phandle), then it'll be easy to create cpu-mask with all cores that point to same l2 cache. Thanks, Hanna
Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC
Hi Boris, Yap, I think we're in agreement here. I believe the important question is whether you need to get error information from multiple sources together in order to do proper recovery or doing it per error source suffices. And I think the actual use cases could/should dictate our drivers/orchestrators design. Thus my question how you guys are planning on tying all that error info the drivers report, into the whole system design? We have daemon script that collects correctable/uncorrectable errors from EDAC sysfs and reports to Amazon service that allow us to take action on specific error thresholds. Thanks, Hanna
Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC
Hi James, Allowing linux to access these implementation-defined registers has come up before: https://www.spinics.net/lists/kernel/msg2750349.html It looks like you've navigated most of the issues. Accessing implementation-defined registers is frowned on, but this stuff can't be done generically until v8.2. Sure, no planning to do this generally for all ARM a57/a72. I'm doing this specific for alpine SoCs. This can't be done on 'all A57/A72' because some platforms may not have been integrated to have error-checking at L1/L2, (1.3 'Features' of [0] says the ECC protection for L1 data cache etc is optional). Even if they did, this stuff needs turning on in L2CTLR_EL1. These implementation-defined registers may be trapped by the hypervisor, I assume for your platform this is linux booted at EL2, so no hypervisor. In Alpine-v2/Alpine-v3 Bit[21]-"ECC and parity enable" in L2CTRL_EL1 is enabled in the firmware. +#define ARM_CA57_CPUMERRSR_INDEX_OFF (0) +#define ARM_CA57_CPUMERRSR_INDEX_MASK (0x3) (GENMASK() would make it quicker and easier to compare this with the datasheet) Will be used in next patchset. +#define ARM_CA57_L2_INC_PLRU_RAM 0x18 A57 describes this one as 'PF RAM'... will be updated. Linux supports versions of binutils that choke on this syntax. See the sys_reg() definitions in arm64's asm/sysreg.h that define something you can feed to read_sysreg_s(). It would save having these wrapper functions. commit 72c583951526 ("arm64: gicv3: Allow GICv3 compilation with older binutils") for the story. Great, I'll use sys_reg(), read_sysreg_s(), and remove the wrapper functions. | #define ARM_CA57_CPUMERRSR_VALID BIT(31) | if (!(val & ARM_CA57_CPUMERRSR_VALID)) would be easier to read, the same goes for 'fatal' as its a single bit. Will be fixed, here and in al_a57_edac_l2merrsr. + edac_device_handle_ce(edac_dev, 0, 0, "L2 Error"); How do we know this was corrected? 6.4.8 "Error Correction Code" has "Double-bit ECC errors set the fatal bit." in a paragraph talking about the L1 memory system. I'll check fatal field to check if it's uncorrected/corrected. "L2 Error" ? Copy and paste? copy/paste mistake, will be fixed. + edac_printk(KERN_CRIT, DRV_NAME, "CPU%d L1 %serror detected\n", + cpu, (fatal) ? "Fatal " : ""); + edac_printk(KERN_CRIT, DRV_NAME, "RAMID="); + + switch (ramid) { + case ARM_CA57_L1_I_TAG_RAM: + pr_cont("'L1-I Tag RAM' index=%d way=%d", index, way); + break; + case ARM_CA57_L1_I_DATA_RAM: + pr_cont("'L1-I Data RAM' index=%d bank= %d", index, way); + break; Is index/way information really useful? I can't replace way-3 on the system, nor can I stop it being used. If its useless, I'd rather we don't bother parsing and printing it out. I'll remove the index/way information, and keep CPUMERRSR_EL1 value print (who want this information can parse the value and get the index/bank status) + pr_cont(", repeat=%d, other=%d (CPUMERRSR_EL1=0x%llx)\n", repeat, other, + val); 'other' here is another error, but we don't know the ramid. 'repeat' is another error for the same ramid. Could we still feed this stuff into edac? This would make the counters accurate if the polling frequency isn't quite fast enough. There is no API in EDAC to increase the counters, I can increase by accessing the ce_count/ue_count from edac_device_ctl_info/edac_device_instance structures if it's okay.. +static void al_a57_edac_l2merrsr(void *arg) +{ + edac_device_handle_ce(edac_dev, 0, 0, "L2 Error"); How do we know this is corrected? If looks like L2CTLR_EL1[20] might force fatal 1/0 to map to uncorrected/corrected. Is this what you are depending on here? No - not on this. Reporting all the errors as corrected seems to be bad. Can i be depends on fatal field? if (fatal) edac_device_handle_ue(edac_dev, 0, 0, "L2 Error"); else edac_device_handle_ce(edac_dev, 0, 0, "L2 Error"); How can L2CTLR_EL1[20] force fatal? (it would be good to have a list of integration-time and firmware dependencies this driver has, for the next person who tries to enable it on their system and complains it doesn't work for them) Where can i add such information? + case ARM_CA57_L2_INC_PLRU_RAM: + pr_cont("'L2 Inclusion PLRU RAM'"); The A57 TRM describes this as "Inclusion PF RAM", and notes its only in r1p0 or later, (but doesn't say what it is). The A72 TRM describes the same encoding as "Inclusion PLRU RAM", which is something to do with its replacement policy. It has control bits that A57's version doesn't, so these are not the same thing. Disambiguating A57/A72 here is a load of faff, 'L2 internal metadata' probably covers both cases, but unless these RAMs are replaceable or can be disabled, there isn't much point working
Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC
Hi Ben, Boris On 6/11/2019 8:50 AM, Benjamin Herrenschmidt wrote: Anyway, let's get back to the specific case of our Amazon platform here since it's a concrete example. Hanna, can you give us a reasonably exhaustive list of how many such "drivers" we'll want in the EDAC subsystem and whether you envision any coordination requirement between them or not ? In the near future we plan to push EDAC drivers for L1/L2 and memory controller. There's no common resources/shared data between them. Thanks, Hanna Cheers, Ben.
Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC
On 5/31/2019 8:14 AM, Borislav Petkov wrote: On Fri, May 31, 2019 at 01:15:33AM +, Herrenschmidt, Benjamin wrote: This isn't terribly helpful, there's nothing telling anybody which of those files corresponds to an ARM SoC :-) drivers/edac/altera_edac.c is one example. Also, James and I have a small writeup on how an arm driver should look like, we just need to polish it up and post it. James? That said ... You really want a single EDAC driver that contains all the stuff for the caches, the memory controller, etc... ? Yap. Disagree. The various drivers don't depend on each other. I think we should keep the drivers separated as they are distinct and independent IP blocks.
Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC
On 5/31/2019 4:15 AM, Herrenschmidt, Benjamin wrote: On Thu, 2019-05-30 at 11:19 -0700, Boris Petkov wrote: On May 30, 2019 3:15:29 AM PDT, Hanna Hawa wrote: Add support for error detection and correction for Amazon's Annapurna Labs SoCs for L1/L2 caches. So this should be a driver for the whole annapurna platform and not only about the RAS functionality in an IP like the caches. See other ARM EDAC drivers in drivers/edac/ for an example. This isn't terribly helpful, there's nothing telling anybody which of those files corresponds to an ARM SoC :-) That said ... You really want a single EDAC driver that contains all the stuff for the caches, the memory controller, etc... ? The idea here was to separate the core L1/L2 EDAC from the memory controller EDAC I think ... Roben, Hanna, can you describe the long run strategy here ? Correct our target to separate the L1/L2 EDAC from mc, and to maintain both in separate drivers. Thanks, Hanna Cheers, Ben.
Re: [PATCH 2/7] irqchip/alpine-msi: Update driver license to use SPDX
On 3/31/2019 3:46 PM, Mukesh Ojha wrote: On 3/31/2019 6:04 PM, Hanna Hawa wrote: Update driver license to be in-line with Linux conventions. Signed-off-by: Hanna Hawa --- drivers/irqchip/irq-alpine-msi.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/irqchip/irq-alpine-msi.c b/drivers/irqchip/irq-alpine-msi.c index ae2fca7..ec6a606 100644 --- a/drivers/irqchip/irq-alpine-msi.c +++ b/drivers/irqchip/irq-alpine-msi.c @@ -1,13 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0 Please fix this as per. https://lkml.org/lkml/2019/2/13/570 I'll fix in next patch-set, I'll wait for more inputs on other patches in the series. Thanks, Hanna cheers, Mukesh /* * Annapurna Labs MSIX support services * * Copyright (C) 2016, Amazon.com, Inc. or its affiliates. All Rights Reserved. * * Antoine Tenart - * - * This file is licensed under the terms of the GNU General Public - * License version 2. This program is licensed "as is" without any - * warranty of any kind, whether express or implied. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt