Re: [PATCH 2/4] edac: mpc85xx add mpc83xx support

2009-07-15 Thread Doug Thompson

Ira or Kumar,

can you address Andrew's concerns below and what was posted in prior posts on 
this?

thanks

doug t

--- On Wed, 7/15/09, Andrew Morton a...@linux-foundation.org wrote:

 From: Andrew Morton a...@linux-foundation.org
 Subject: Re: [PATCH 2/4] edac: mpc85xx add mpc83xx support
 To: dougthomp...@xmission.com
 Cc: bluesmoke-de...@lists.sourceforge.net, linux-ker...@vger.kernel.org
 Date: Wednesday, July 15, 2009, 1:52 PM
 On Wed, 15 Jul 2009 11:38:49 -0600
 dougthomp...@xmission.com
 wrote:
 
  
  Add support for the Freescale MPC83xx memory
 controller to the existing
  driver for the Freescale MPC85xx memory controller.
 The only difference
  between the two processors are in the CS_BNDS register
 parsing code, which
  has been changed so it will work on both processors.
  
  The L2 cache controller does not exist on the MPC83xx,
 but the OF subsystem
  will not use the driver if the device is not present
 in the OF device tree.
  
  
  Kumar, I had to change the nr_pages calculation to
 make the math work
  out. I checked it on my board and did the math by hand
 for a 64GB 85xx
  using 64K pages. In both cases, nr_pages * PAGE_SIZE
 comes out to the
  correct value. Thanks for the help.
  
  v1 - v2:
    * Use PAGE_SHIFT to parse cs_bnds
 regardless of board type
    * Remove special-casing for the 83xx
 processor
  
  ...
 
  @@ -789,19 +791,20 @@ static void __devinit
 mpc85xx_init_csrow
           csrow =
 mci-csrows[index];
           cs_bnds =
 in_be32(pdata-mc_vbase + MPC85XX_MC_CS_BNDS_0 +
          
           (index *
 MPC85XX_MC_CS_BNDS_OFS));
  -        start =
 (cs_bnds  0xfff)  4;
  -        end = ((cs_bnds
  0xfff)  20);
  -        if (start)
  -       
     start |= 0xf;
  -        if (end)
  -       
     end |= 0xf;
  +
  +        start =
 (cs_bnds  0x)  16;
  +       
 end   = (cs_bnds  0x);
   
           if (start
 == end)
          
     continue;    /* not
 populated */
   
  +        start =
 (24 - PAGE_SHIFT);
  +       
 end   = (24 - PAGE_SHIFT);
  +        end 
   |= (1  (24 - PAGE_SHIFT)) - 1;
 
 stares for a while
 
 That looks like the original code was really really wrong.
 
 The setting of all the lower bits in `end' is
 funny-looking.  What's
 happening here?  Should it be commented?
 
 
          
 csrow-first_page = start  PAGE_SHIFT;
          
 csrow-last_page = end  PAGE_SHIFT;
  -       
 csrow-nr_pages = csrow-last_page + 1 -
 csrow-first_page;
  +       
 csrow-nr_pages = end + 1 - start;
          
 csrow-grain = 8;
          
 csrow-mtype = mtype;
          
 csrow-dtype = DEV_UNKNOWN;
  @@ -985,6 +988,7 @@ static struct of_device_id
 mpc85xx_mc_er
       { .compatible =
 fsl,mpc8560-memory-controller, },
       { .compatible =
 fsl,mpc8568-memory-controller, },
       { .compatible =
 fsl,mpc8572-memory-controller, },
  +    { .compatible =
 fsl,mpc8349-memory-controller, },
       { .compatible =
 fsl,p2020-memory-controller, },
       {},
   };
  @@ -1001,13 +1005,13 @@ static struct
 of_platform_driver mpc85xx
          
    },
   };
   
  -
  +#ifdef CONFIG_MPC85xx
   static void __init mpc85xx_mc_clear_rfxe(void
 *data)
   {
       orig_hid1[smp_processor_id()]
 = mfspr(SPRN_HID1);
       mtspr(SPRN_HID1,
 (orig_hid1[smp_processor_id()]  ~0x2));
   }
  -
  +#endif
   
   static int __init mpc85xx_mc_init(void)
   {
  @@ -1040,26 +1044,32 @@ static int __init
 mpc85xx_mc_init(void)
          
 printk(KERN_WARNING EDAC_MOD_STR PCI fails to
 register\n);
   #endif
   
  +#ifdef CONFIG_MPC85xx
       /*
    * need to clear
 HID1[RFXE] to disable machine check int
    * so we can catch
 it
    */
       if (edac_op_state ==
 EDAC_OPSTATE_INT)
          
 on_each_cpu(mpc85xx_mc_clear_rfxe, NULL, 0);
  +#endif
   
       return 0;
   }
 
 The patch adds lots of ifdefs :(
 
   module_init(mpc85xx_mc_init);
   
  +#ifdef CONFIG_MPC85xx
   static void __exit mpc85xx_mc_restore_hid1(void
 *data)
   {
       mtspr(SPRN_HID1,
 orig_hid1[smp_processor_id()]);
   }
  +#endif
 
 afacit this will run smp_processor_id() from within
 preemptible code,
 which is often buggy on preemptible kernels and will cause
 runtime
 warnings on at least some architectures.
 
   static void __exit mpc85xx_mc_exit(void)
   {
  +#ifdef CONFIG_MPC85xx
      
 on_each_cpu(mpc85xx_mc_restore_hid1, NULL, 0);
  +#endif
   #ifdef CONFIG_PCI
      
 of_unregister_platform_driver(mpc85xx_pci_err_driver);
   #endif
 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [v2 PATCH 2/3] EDAC: Add edac_device_alloc_index()

2009-04-17 Thread Doug Thompson

--- On Wed, 4/15/09, Andrew Morton a...@linux-foundation.org wrote:

 From: Andrew Morton a...@linux-foundation.org
 Subject: Re: [v2 PATCH 2/3] EDAC: Add edac_device_alloc_index()
 To: Harry Ciao qingtao@windriver.com
 Cc: nor...@yahoo.com, mich...@ellerman.id.au, 
 bluesmoke-de...@lists.sourceforge.net, linuxppc-dev@ozlabs.org, 
 linux-ker...@vger.kernel.org
 Date: Wednesday, April 15, 2009, 4:27 PM
 On Mon, 13 Apr 2009 14:05:15 +0800
 Harry Ciao qingtao@windriver.com
 wrote:
 
  Add edac_device_alloc_index(), because for MAPLE
 platform there may
  exist several EDAC driver modules that could make use
 of
  edac_device_ctl_info structure at the same time. The
 index allocation
  for these structures should be taken care of by EDAC
 core.
  
 
 From: Andrew Morton a...@linux-foundation.org
 
 keep things neat.  Also avoids having global
 identifier device_index
 shadowed by local identifier device_index.
 
 Cc: Benjamin Herrenschmidt b...@kernel.crashing.org

Acked-by: Doug Thompson dougthomp...@xmission.com

 Cc: Harry Ciao qingtao@windriver.com
 Cc: Kumar Gala ga...@gate.crashing.org
 Cc: Michael Ellerman mich...@ellerman.id.au
 Cc: Paul Mackerras pau...@samba.org
 Signed-off-by: Andrew Morton a...@linux-foundation.org
 ---
 
  drivers/edac/edac_device.c |    3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 diff -puN
 drivers/edac/amd8111_edac.c~edac-add-edac_device_alloc_index-cleanup
 drivers/edac/amd8111_edac.c
 diff -puN
 drivers/edac/edac_core.h~edac-add-edac_device_alloc_index-cleanup
 drivers/edac/edac_core.h
 diff -puN
 drivers/edac/edac_device.c~edac-add-edac_device_alloc_index-cleanup
 drivers/edac/edac_device.c
 ---
 a/drivers/edac/edac_device.c~edac-add-edac_device_alloc_index-cleanup
 +++ a/drivers/edac/edac_device.c
 @@ -37,7 +37,6 @@
   */
  static DEFINE_MUTEX(device_ctls_mutex);
  static LIST_HEAD(edac_device_list);
 -static atomic_t device_indexes = ATOMIC_INIT(0);
  
  #ifdef CONFIG_EDAC_DEBUG
  static void edac_device_dump_device(struct
 edac_device_ctl_info *edac_dev)
 @@ -499,6 +498,8 @@ void
 edac_device_reset_delay_period(stru
   */
  int edac_device_alloc_index(void)
  {
 +    static atomic_t device_indexes =
 ATOMIC_INIT(0);
 +
      return
 atomic_inc_return(device_indexes) - 1;
  }
  EXPORT_SYMBOL_GPL(edac_device_alloc_index);
 _
 
 
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH v2] edac: mpc85xx: Add support for MPC8572

2008-10-07 Thread Doug Thompson


Dave Jiang [EMAIL PROTECTED] wrote: There's an SVN+quilt tree via sourceforge 
for EDAC. I have asked Doug to push 
this patch upstream to the mm tree.

Kumar Gala wrote:
 
 On Sep 19, 2008, at 6:20 PM, Nate Case wrote:
 
 From: Andrew Kilkenny 

 This adds support for the dual-core MPC8572 processor.  We have
 to support making SPR changes on each core.  Also, since we can
 have multiple memory controllers sharing an interrupt, flag the
 interrupts with IRQF_SHARED.

 Signed-off-by: Andrew Kilkenny 
 Signed-off-by: Nate Case 
 ---
 drivers/edac/mpc85xx_edac.c |   28 +++-
 1 files changed, 23 insertions(+), 5 deletions(-)
 
 Acked-by: Kumar Gala 
 
 Guys, is there an edac git tree or something to create patches against?  
 I've got one I've been sitting on but it should be updated based on 
 Nate's patch.
 
 - k

the SVN repos is 

svn checkout https://bluesmoke.svn.sourceforge.net/svnroot/bluesmoke/trunk

the info page is

http://bluesmoke.sourceforge.net/

doug t



W1DUG___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [patch 1/9] powerpc/cell/edac: log a syndrome code in case of correctable error

2008-07-17 Thread Doug Thompson

--- Benjamin Herrenschmidt [EMAIL PROTECTED] wrote:

 Arnd, Maxim, please, next time, send that patch or at least CC the
 bluesmoke-devel list for EDAC related bits.
 
 Doug, if you are ok with this patch, I'll merge it via the powerpc

fine with me, acked below

doug t

 tree.
 
 Cheers,
 Ben.
 
 On Tue, 2008-07-15 at 21:51 +0200, [EMAIL PROTECTED] 
 
  From: Maxim Shchetynin [EMAIL PROTECTED]
  
  If correctable error occurs the syndrome code was logged as 0. This patch
  lets EDAC to log a correct syndrome code to make problem investigation
  easier.
  
  Signed-off-by: Maxim Shchetynin [EMAIL PROTECTED]
  Signed-off-by: Arnd Bergmann [EMAIL PROTECTED]

Acked-by: Doug Thompson [EMAIL PROTECTED]


  ---
   drivers/edac/cell_edac.c |5 +++--
   1 files changed, 3 insertions(+), 2 deletions(-)
  
  diff --git a/drivers/edac/cell_edac.c b/drivers/edac/cell_edac.c
  index b54112f..0e024fe 100644
  --- a/drivers/edac/cell_edac.c
  +++ b/drivers/edac/cell_edac.c
  @@ -33,7 +33,7 @@ static void cell_edac_count_ce(struct mem_ctl_info *mci, 
  int chan, u64 ar)
   {
  struct cell_edac_priv   *priv = mci-pvt_info;
  struct csrow_info   *csrow = mci-csrows[0];
  -   unsigned long   address, pfn, offset;
  +   unsigned long   address, pfn, offset, syndrome;
   
  dev_dbg(mci-dev, ECC CE err on node %d, channel %d, ar = 0x%016lx\n,
  priv-node, chan, ar);
  @@ -44,10 +44,11 @@ static void cell_edac_count_ce(struct mem_ctl_info 
  *mci, int chan, u64 ar)
  address = (address  1) | chan;
  pfn = address  PAGE_SHIFT;
  offset = address  ~PAGE_MASK;
  +   syndrome = (ar  0x1fe0ul)  21;
   
  /* TODO: Decoding of the error addresss */
  edac_mc_handle_ce(mci, csrow-first_page + pfn, offset,
  - 0, 0, chan, );
  + syndrome, 0, chan, );
   }
   
   static void cell_edac_count_ue(struct mem_ctl_info *mci, int chan, u64 ar)
  -- 
  1.5.4.3
  
 
 


W1DUG
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] [POWERPC] pasemi: Broaden specific references to 1682M

2007-11-05 Thread Doug Thompson

I assume then that this patch will move upstream via the POWERPC path, is that 
right?


Signed-off-by: Doug Thompson [EMAIL PROTECTED]


--- Olof Johansson [EMAIL PROTECTED] wrote:

 [POWERPC] pasemi: Broaden specific references to 1682M
 
 There will be more product numbers in the future than just PA6T-1682M,
 but they will share much of the features. Remove some of the explicit
 references and compatibility checks with 1682M, and replace most of them
 with the more generic term PWRficient.
 
 
 Signed-off-by: Olof Johansson [EMAIL PROTECTED]
 
 ---
 
 This one touches drivers/char/hw_random and drivers/edac, but I'd prefer
 to just merge it up through the powerpc merge path since the changes
 are trivial.
 
 (Michael, Doug, if you disagree let me know and I can submit separate
 patches. This is 2.6.25 material anyway).
 
 
 -Olof
 
 
 diff --git a/arch/powerpc/platforms/pasemi/Kconfig 
 b/arch/powerpc/platforms/pasemi/Kconfig
 index 735e153..2f4dd6e 100644
 --- a/arch/powerpc/platforms/pasemi/Kconfig
 +++ b/arch/powerpc/platforms/pasemi/Kconfig
 @@ -17,7 +17,7 @@ config PPC_PASEMI_IOMMU
   bool PA Semi IOMMU support
   depends on PPC_PASEMI
   help
 -   IOMMU support for PA6T-1682M
 +   IOMMU support for PA Semi PWRficient
  
  config PPC_PASEMI_IOMMU_DMA_FORCE
   bool Force DMA engine to use IOMMU
 diff --git a/arch/powerpc/platforms/pasemi/cpufreq.c 
 b/arch/powerpc/platforms/pasemi/cpufreq.c
 index 1cfb8b0..8caa166 100644
 --- a/arch/powerpc/platforms/pasemi/cpufreq.c
 +++ b/arch/powerpc/platforms/pasemi/cpufreq.c
 @@ -147,7 +147,10 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
 *policy)
   if (!cpu)
   goto out;
  
 - dn = of_find_compatible_node(NULL, sdc, 1682m-sdc);
 + dn = of_find_compatible_node(NULL, NULL, 1682m-sdc);
 + if (!dn)
 + dn = of_find_compatible_node(NULL, NULL,
 +  pasemi,pwrficient-sdc);
   if (!dn)
   goto out;
   err = of_address_to_resource(dn, 0, res);
 @@ -160,7 +163,10 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
 *policy)
   goto out;
   }
  
 - dn = of_find_compatible_node(NULL, gizmo, 1682m-gizmo);
 + dn = of_find_compatible_node(NULL, NULL, 1682m-gizmo);
 + if (!dn)
 + dn = of_find_compatible_node(NULL, NULL,
 +  pasemi,pwrficient-gizmo);
   if (!dn) {
   err = -ENODEV;
   goto out_unmap_sdcasr;
 @@ -292,7 +298,8 @@ static struct cpufreq_driver pas_cpufreq_driver = {
  
  static int __init pas_cpufreq_init(void)
  {
 - if (!machine_is_compatible(PA6T-1682M))
 + if (!machine_is_compatible(PA6T-1682M) 
 + !machine_is_compatible(pasemi,pwrficient))
   return -ENODEV;
  
   return cpufreq_register_driver(pas_cpufreq_driver);
 diff --git a/arch/powerpc/platforms/pasemi/gpio_mdio.c
 b/arch/powerpc/platforms/pasemi/gpio_mdio.c
 index 95d0c78..b029804 100644
 --- a/arch/powerpc/platforms/pasemi/gpio_mdio.c
 +++ b/arch/powerpc/platforms/pasemi/gpio_mdio.c
 @@ -333,7 +333,10 @@ int gpio_mdio_init(void)
  {
   struct device_node *np;
  
 - np = of_find_compatible_node(NULL, gpio, 1682m-gpio);
 + np = of_find_compatible_node(NULL, NULL, 1682m-gpio);
 + if (!np)
 + np = of_find_compatible_node(NULL, NULL,
 +  pasemi,pwrficient-gpio);
   if (!np)
   return -ENODEV;
   gpio_regs = of_iomap(np, 0);
 diff --git a/arch/powerpc/platforms/pasemi/setup.c 
 b/arch/powerpc/platforms/pasemi/setup.c
 index 3a5d112..aeafe98 100644
 --- a/arch/powerpc/platforms/pasemi/setup.c
 +++ b/arch/powerpc/platforms/pasemi/setup.c
 @@ -362,8 +362,11 @@ static inline void pasemi_pcmcia_init(void)
  
  
  static struct of_device_id pasemi_bus_ids[] = {
 + /* Unfortunately needed for legacy firmwares */
   { .type = localbus, },
   { .type = sdc, },
 + { .compatible = pasemi,localbus, },
 + { .compatible = pasemi,sdc, },
   {},
  };
  
 @@ -389,7 +392,8 @@ static int __init pas_probe(void)
  {
   unsigned long root = of_get_flat_dt_root();
  
 - if (!of_flat_dt_is_compatible(root, PA6T-1682M))
 + if (!of_flat_dt_is_compatible(root, PA6T-1682M) 
 + !of_flat_dt_is_compatible(root, pasemi,pwrficient))
   return 0;
  
   hpte_init_native();
 @@ -400,7 +404,7 @@ static int __init pas_probe(void)
  }
  
  define_machine(pasemi) {
 - .name   = PA Semi PA6T-1682M,
 + .name   = PA Semi PWRficient,
   .probe  = pas_probe,
   .setup_arch = pas_setup_arch,
   .init_early = pas_init_early,
 diff --git a/drivers/char/hw_random/Kconfig b/drivers/char/hw_random/Kconfig
 index 2d7cd48..6bbd4fa 100644
 --- a/drivers/char/hw_random/Kconfig
 +++ b/drivers/char/hw_random/Kconfig
 @@ -98,7 +98,7

Re: EDAC stats PCI error recovery (was Re: [PATCH 2/2] powerpc: MPC85xx EDAC device driver)

2007-08-01 Thread Doug Thompson

--- Linas Vepstas [EMAIL PROTECTED] wrote:

 On Mon, Jul 30, 2007 at 03:47:05PM -0700, Doug Thompson wrote:
  
  --- Linas Vepstas [EMAIL PROTECTED] wrote:
   Also: please note that the linux kernel has a pci error recovery
   mechanism built in; its used by pseries and PCI-E. I'm not clear
   on what any of this has to do with EDAC, which I thought was supposed 
   to be for RAM only. (The EDAC project once talked about doing pci error 
   recovery, but that was years ago, and there is a separate system for
   that, now.)
  
  no, edac can/does harvest PCI bus errors, via polling and other hardware 
  error detectors.
 
 Ehh! I had no idea. A few years ago, when I was working on the PCI error
 recovery, I sent a number of emails to the various EDAC people and mailing 
 lists that I could find, and never got a response.  I assumed the
 project was dead. I guess its not ... 

No its not, just some company lay offs stirred the pot, at least for me, for 
awhile.
I did see the ibm patches go by, but didn't have the time to check up at that 
time. I actually,
didn't know the recovery interface had gotten into the kernel (My failure to 
watch for them), so I
was pleasantly surprised at this last OLS to attend the presentation.

 
  But at the current time, few PCI device drivers initialize those callback 
  functions and
  thus errors are lost and some IO transactions fail.
 
 There are patches for 6 drivers in mainline (e100, e1000, ixgb, s2io,
 ipr, lpfc), and two more pending (sym53cxxx, tg3).  So far, I've written 
 all of them. 

Great.
EDAC does nothing for recovery, just logging and stats gathering and 
presentation.

 
  Over time, as drivers get updated (might take some time) then drivers
  can take some sort of action FOR THEMSELVES
 
 I think I need to do more to raise awareness and interest.

good point

 
  Yet, there is no tracking of errors - except for a log message in the log 
  file.
  
  There is NO meter on frequency of errors, etc. One must grep the log file 
  and that is not a
 very
  cycle friendly mechanism.
 
 Yeah, there was low interest in stats. There's a core set of stats in
 /proc/pp64/eeh, but these are clearly arch-specific. I'd ike to move
 away from those.  Some recent patches added stats to the /sys tree,
 under the individual pci bridge and device nodes.  Again, these are
 arch-specific; I'd like to move to some geeral/standardized presentation.

the memory error consumers really like the stats of EDAC. Allows them to track 
trends.
Cluster types, with thousands of nodes, like the monitoring for both memory and 
PCI, as well as
some newer hardware detector harvesting.

 
  The reason I added PCI parity/error device scanning, was that when I was at 
  Linux Networx, we
 had
  parity errors on the PCI-X bus, but didn't know the cause.  After we 
  discovered that a simple
  PCI-X riser card had manufacturing problems (quality) and didn't drive 
  lines properly, it
 caused
  parity errors. 
 
 Heh. Not unusual. I've seen/heard of cases with voltages being low,
 and/or ground-bounce in slots near the end. There's a whole zoo of
 hardware/firmware bugs that we've had to painfully crawl through and
 fix. That's why the IBM boxes cost big $$$; here's to hoping that 
 customers understand why.

I understand


 
  This feature allowed us to track nodes that were having parity problems, 
  but we had
  no METER to know it.
  
  Recovery is a good thing, BUT how do you know you having LOTS of 
  errors/recovery events? You
 need
  a meter. EDAC provides that METER
 
 I'm lazy. What source code should I be looking at?  I'm concerned about
 duplication of function and proliferation of interfaces. I've got my 
 metering data under (for example)
 /sys/bus/pci/devices/0001:c0:01.0/eeh_*, mostly very arch specific.
 The code for this is in arch/powerpc/platforms/pseries/eeh_sysfs.c

http://bluesmoke.sourceforge.net/

is the SF project zone (bluesmoke was the out-of-tree name, changed to EDAC 
when it came into
tree, and source forge doesn't allow renaming)

EDAC info is under:

/sys/devices/system/edac/

mc for memory controllers
pci for pci info.

very basic, just counters and some controls


 
  I met with Yanmin Zhang of Intel at OLS after his paper presentation on PCI 
  Express Advanced
 Error
  Reporting in the Kernel, and we talked about this same thing. I am talking 
  with him on having
 the
  recovery code present information into EDAC sysfs area. (hopefully, anyway)
 
 Hmm. OK, where's that?  Back when, I'd talked to Yamin about coming up 
 with a generic, arch-indep way of driving the recovery routines. But
 this wasn't exactly easy, and we were still grappling with just getting
 things working.  Now that things are working, its time to broaden
 horizons.

Not very far, but I see the potential.
When EDAC was received, it was placed where it was in the sysfs from various 
kernel developers as
a good spot on its own.

 
 Can you point me to the current edac code?
 find

Re: [PATCH 2/2] powerpc: MPC85xx EDAC device driver

2007-07-30 Thread Doug Thompson

--- Linas Vepstas [EMAIL PROTECTED] wrote:

 On Mon, Jul 30, 2007 at 01:17:40PM -0700, Dave Jiang wrote:
  Arnd Bergmann wrote:
   The best solution may be to look at how it's structured at the
   register level. If the PCI EDAC registers are implemented separately
   from the regular PCI registers, a device tree entry would be appropriate.
   If not, your idea of registering a platform_device from fsl_add_bridge
   is probably more sensible.
   
  
  We can probably do either. From looking at the 8560 and 8548 manuals, the 
  PCI
  error registers are 0xe00 offset of the start of PCI registers. For example,
  the PCI registers would start at 0x8000 offset. And the PCI error registers
  would be at 0xe00 offset from there and would be the very last block of
  registers. 
 
 Anywhere I can easily get an overview of these PCI error registers?
 
 Also: please note that the linux kernel has a pci error recovery
 mechanism built in; its used by pseries and PCI-E. I'm not clear
 on what any of this has to do with EDAC, which I thought was supposed 
 to be for RAM only. (The EDAC project once talked about doing pci error 
 recovery, but that was years ago, and there is a separate system for
 that, now.)

no, edac can/does harvest PCI bus errors, via polling and other hardware error 
detectors.

The pci error recovery added a couple of NEW device callback functions in the 
driver interface, 
which the bus layer can call to notify drivers that a PCI bus error occurred. 
Then the driver can
do some action on the event.

But at the current time, few PCI device drivers initialize those callback 
functions and
thus errors are lost and some IO transactions fail.

Over time, as drivers get updated (might take some time) then drivers
can take some sort of action FOR THEMSELVES

Yet, there is no tracking of errors - except for a log message in the log file.

There is NO meter on frequency of errors, etc. One must grep the log file and 
that is not a very
cycle friendly mechanism.

The reason I added PCI parity/error device scanning, was that when I was at 
Linux Networx, we had
parity errors on the PCI-X bus, but didn't know the cause.  After we discovered 
that a simple
PCI-X riser card had manufacturing problems (quality) and didn't drive lines 
properly, it caused
parity errors. This feature allowed us to track nodes that were having parity 
problems, but we had
no METER to know it.

Recovery is a good thing, BUT how do you know you having LOTS of 
errors/recovery events? You need
a meter. EDAC provides that METER

I met with Yanmin Zhang of Intel at OLS after his paper presentation on PCI 
Express Advanced Error
Reporting in the Kernel, and we talked about this same thing. I am talking with 
him on having the
recovery code present information into EDAC sysfs area. (hopefully, anyway)

The recovery generates log messages BUT having to periodically 'grep' the log 
file looking for
errors is not a good use of CPU cycles. grep once for a count and then grep 
later for a count and
then compare the counts for a delta count per unit time. ugly.

The EDAC solution is to be able to have a Listener thread in user space that 
can be notified (via
poll()) that an event has occurred.

There are more than one consumer (error recover) of error events:

1) driver recovery after a transaction (which is the recovery consumer above)
2) Management agents for health of a node
3) Maintainance agents for predictive component replacement

Rates of change of errors can be gathered as well.

EDAC allows for presentation of error counts via sysfs entries, from which user 
space
programs can harvest for over-time profiling

We have MEMORY (edac_mc) devices for chipsets now, but via the new edac_device 
class, such things
as ECC error tracking on DMA error checkers, FABRIC switchs, L1 and L2 cache 
ECC events, core CPU
data ECC checkers, etc can be done. I have an out of kernel tree MIPS driver do 
just this. Other
types of harvesters can be generated as well for other and/or new hardware 
error detectors.

doug thompson

 
 --linas
 
 
 

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev