Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-19 Thread Mauro Carvalho Chehab
Em Mon, 18 Feb 2013 12:24:29 -0300
Mauro Carvalho Chehab  escreveu:

> Em Mon, 18 Feb 2013 14:52:51 +0100
> Borislav Petkov  escreveu:
> 
> > On Sun, Feb 17, 2013 at 07:44:04AM -0300, Mauro Carvalho Chehab wrote:
> > > We could do it for the location. The space for label, however, depends on
> > > how many DIMMs are in the system, as multiple dimm's may be present, and
> > > the core will point to all possible affected DIMMs.
> > > 
> > > Ok, perhaps we could just allocate one big area for it (like one page), 
> > > as this would very likely be enough for it, and change the logic to take
> > > the buffer size into account when filling it.
> > 
> > Or, in the case where ->label is all dimms on the mci, you simply put
> > "All DIMMs on MCI%d" in there and done. Simple.
> 
> The core does this already when it has no glue at all about where is the
> error.
> 
> The core is prepared to the case where the location is only half-filled,
> as this is a common scenario on the drivers, and important enough on
> some memory controllers.
> 
> As already discussed, on most memory controllers nowadays, the memory
> controller can't point to a single DIMM, as the error correction code
> takes 128 bits (2 DIMMs). It is impossible for the error correction
> code to determine on what DIMM an uncorrected error happened[1].
> 
> With Nehalem memory controllers, depending on the memory configuration,
> the minimal DIMM granularity for an uncorrected error can be even worse: 
> 4 DIMMs, if 128-bits error correction code and mirror mode are both enabled.
> 
> There are some border cases where the driver can simply not discover on
> what channel or on what dimm(or csrow) inside a channel the error
> happened. The error could be associated with some failure at the logic
> or at the bus that communicated with the Advanced Memory Buffers on an
> FB-DIMM memory controller, for example.
> 
> So, the real core's worse case scenario would be if the driver can't
> determine on what DIMM inside a channel the error happened. As a channel
> can have a large number of DIMMs[2] the allocated area for the label
> should be conservative.
> 
> 
>  (16? Not sure what's the worse case),
> 
> [1] such error can even not be fatal, if that particular address is
> unused.
> 
> [2] Currently, up to 8, according with:
>   $for i in $(git grep "layers.*size\s*=" drivers/edac|perl -ne 'print 
> "$1 " if (m/\=\s*([A-Z][^\s]+);/);'); do echo $i; git grep $i drivers/edac; 
> done|grep define|perl -ne 'print "$1 " if (m/define\s+[^\s]+\s(\d+)/)'
>   8 8 2 2 4 2 3 3 3 8 4 4 3 3 1 1 4 
> 
> and
>   $ git grep "layers.*size\s*=" drivers/edac|perl -ne 'print "$1 " if 
> (m/\=\s*(\d+);/);'
>   1 1 1 1 2 2 8 4 1 1 1 1 
> 
> Nothing prevents that a driver would have more than 8 DIMMs per layer
> in the future.

I suspect that you'll be happy with the enclosed patch ;)

It embeds the two string buffers at the mci structure. There are space there
for up to EDAC_MAX_LABELS at the "mci->label" string. If an error affects
more than EDAC_MAX_LABELS, the report logic will write "any memory", just like
what happens when the driver can't discover where the error is.

Tested with sb_edac driver.

Regards,
Mauro

[PATCH EDAC] edac: put all arguments for the raw error handling call into a 
struct

The number of arguments for edac_raw_mc_handle_error() is too big;
put them into a structure and allocate space for it inside
edac_mc_alloc().

That reduces a lot the stack usage and simplifies the raw API call.

Tested with sb_edac driver and MCE error injection. Worked as expected:

[  143.066100] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 
(channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM 
err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.086424] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 
(channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM 
err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.106570] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 
(channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM 
err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.126712] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 
(channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM 
err_code:0001:0090 socket:0 channel_mask:1 rank:0)

Signed-off-by: Mauro Carvalho Chehab 

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 9c5da11..3c2625e 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -455,20 +455,8 @@ extern int edac_mc_find_csrow_by_page(struct mem_ctl_info 
*mci,
  unsigned long page);
 
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
- struct mem_ctl_info *mci,
- long grain,
- const u16 error_count,
- 

Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-19 Thread Mauro Carvalho Chehab
Em Mon, 18 Feb 2013 12:24:29 -0300
Mauro Carvalho Chehab mche...@redhat.com escreveu:

 Em Mon, 18 Feb 2013 14:52:51 +0100
 Borislav Petkov b...@alien8.de escreveu:
 
  On Sun, Feb 17, 2013 at 07:44:04AM -0300, Mauro Carvalho Chehab wrote:
   We could do it for the location. The space for label, however, depends on
   how many DIMMs are in the system, as multiple dimm's may be present, and
   the core will point to all possible affected DIMMs.
   
   Ok, perhaps we could just allocate one big area for it (like one page), 
   as this would very likely be enough for it, and change the logic to take
   the buffer size into account when filling it.
  
  Or, in the case where -label is all dimms on the mci, you simply put
  All DIMMs on MCI%d in there and done. Simple.
 
 The core does this already when it has no glue at all about where is the
 error.
 
 The core is prepared to the case where the location is only half-filled,
 as this is a common scenario on the drivers, and important enough on
 some memory controllers.
 
 As already discussed, on most memory controllers nowadays, the memory
 controller can't point to a single DIMM, as the error correction code
 takes 128 bits (2 DIMMs). It is impossible for the error correction
 code to determine on what DIMM an uncorrected error happened[1].
 
 With Nehalem memory controllers, depending on the memory configuration,
 the minimal DIMM granularity for an uncorrected error can be even worse: 
 4 DIMMs, if 128-bits error correction code and mirror mode are both enabled.
 
 There are some border cases where the driver can simply not discover on
 what channel or on what dimm(or csrow) inside a channel the error
 happened. The error could be associated with some failure at the logic
 or at the bus that communicated with the Advanced Memory Buffers on an
 FB-DIMM memory controller, for example.
 
 So, the real core's worse case scenario would be if the driver can't
 determine on what DIMM inside a channel the error happened. As a channel
 can have a large number of DIMMs[2] the allocated area for the label
 should be conservative.
 
 
  (16? Not sure what's the worse case),
 
 [1] such error can even not be fatal, if that particular address is
 unused.
 
 [2] Currently, up to 8, according with:
   $for i in $(git grep layers.*size\s*= drivers/edac|perl -ne 'print 
 $1  if (m/\=\s*([A-Z][^\s]+);/);'); do echo $i; git grep $i drivers/edac; 
 done|grep define|perl -ne 'print $1  if (m/define\s+[^\s]+\s(\d+)/)'
   8 8 2 2 4 2 3 3 3 8 4 4 3 3 1 1 4 
 
 and
   $ git grep layers.*size\s*= drivers/edac|perl -ne 'print $1  if 
 (m/\=\s*(\d+);/);'
   1 1 1 1 2 2 8 4 1 1 1 1 
 
 Nothing prevents that a driver would have more than 8 DIMMs per layer
 in the future.

I suspect that you'll be happy with the enclosed patch ;)

It embeds the two string buffers at the mci structure. There are space there
for up to EDAC_MAX_LABELS at the mci-label string. If an error affects
more than EDAC_MAX_LABELS, the report logic will write any memory, just like
what happens when the driver can't discover where the error is.

Tested with sb_edac driver.

Regards,
Mauro

[PATCH EDAC] edac: put all arguments for the raw error handling call into a 
struct

The number of arguments for edac_raw_mc_handle_error() is too big;
put them into a structure and allocate space for it inside
edac_mc_alloc().

That reduces a lot the stack usage and simplifies the raw API call.

Tested with sb_edac driver and MCE error injection. Worked as expected:

[  143.066100] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 
(channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM 
err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.086424] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 
(channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM 
err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.106570] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 
(channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM 
err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.126712] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 
(channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM 
err_code:0001:0090 socket:0 channel_mask:1 rank:0)

Signed-off-by: Mauro Carvalho Chehab mche...@redhat.com

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 9c5da11..3c2625e 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -455,20 +455,8 @@ extern int edac_mc_find_csrow_by_page(struct mem_ctl_info 
*mci,
  unsigned long page);
 
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
- struct mem_ctl_info *mci,
- long grain,
- const u16 error_count,
- const int top_layer,
- 

Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-18 Thread Mauro Carvalho Chehab
Em Mon, 18 Feb 2013 14:52:51 +0100
Borislav Petkov  escreveu:

> On Sun, Feb 17, 2013 at 07:44:04AM -0300, Mauro Carvalho Chehab wrote:
> > We could do it for the location. The space for label, however, depends on
> > how many DIMMs are in the system, as multiple dimm's may be present, and
> > the core will point to all possible affected DIMMs.
> > 
> > Ok, perhaps we could just allocate one big area for it (like one page), 
> > as this would very likely be enough for it, and change the logic to take
> > the buffer size into account when filling it.
> 
> Or, in the case where ->label is all dimms on the mci, you simply put
> "All DIMMs on MCI%d" in there and done. Simple.

The core does this already when it has no glue at all about where is the
error.

The core is prepared to the case where the location is only half-filled,
as this is a common scenario on the drivers, and important enough on
some memory controllers.

As already discussed, on most memory controllers nowadays, the memory
controller can't point to a single DIMM, as the error correction code
takes 128 bits (2 DIMMs). It is impossible for the error correction
code to determine on what DIMM an uncorrected error happened[1].

With Nehalem memory controllers, depending on the memory configuration,
the minimal DIMM granularity for an uncorrected error can be even worse: 
4 DIMMs, if 128-bits error correction code and mirror mode are both enabled.

There are some border cases where the driver can simply not discover on
what channel or on what dimm(or csrow) inside a channel the error
happened. The error could be associated with some failure at the logic
or at the bus that communicated with the Advanced Memory Buffers on an
FB-DIMM memory controller, for example.

So, the real core's worse case scenario would be if the driver can't
determine on what DIMM inside a channel the error happened. As a channel
can have a large number of DIMMs[2] the allocated area for the label
should be conservative.


 (16? Not sure what's the worse case),

[1] such error can even not be fatal, if that particular address is
unused.

[2] Currently, up to 8, according with:
$for i in $(git grep "layers.*size\s*=" drivers/edac|perl -ne 'print 
"$1 " if (m/\=\s*([A-Z][^\s]+);/);'); do echo $i; git grep $i drivers/edac; 
done|grep define|perl -ne 'print "$1 " if (m/define\s+[^\s]+\s(\d+)/)'
8 8 2 2 4 2 3 3 3 8 4 4 3 3 1 1 4 

and
$ git grep "layers.*size\s*=" drivers/edac|perl -ne 'print "$1 " if 
(m/\=\s*(\d+);/);'
1 1 1 1 2 2 8 4 1 1 1 1 

Nothing prevents that a driver would have more than 8 DIMMs per layer
in the future.

-- 

Cheers,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-18 Thread Borislav Petkov
On Sun, Feb 17, 2013 at 07:44:04AM -0300, Mauro Carvalho Chehab wrote:
> We could do it for the location. The space for label, however, depends on
> how many DIMMs are in the system, as multiple dimm's may be present, and
> the core will point to all possible affected DIMMs.
> 
> Ok, perhaps we could just allocate one big area for it (like one page), 
> as this would very likely be enough for it, and change the logic to take
> the buffer size into account when filling it.

Or, in the case where ->label is all dimms on the mci, you simply put
"All DIMMs on MCI%d" in there and done. Simple.

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-18 Thread Borislav Petkov
On Sun, Feb 17, 2013 at 07:44:04AM -0300, Mauro Carvalho Chehab wrote:
 We could do it for the location. The space for label, however, depends on
 how many DIMMs are in the system, as multiple dimm's may be present, and
 the core will point to all possible affected DIMMs.
 
 Ok, perhaps we could just allocate one big area for it (like one page), 
 as this would very likely be enough for it, and change the logic to take
 the buffer size into account when filling it.

Or, in the case where -label is all dimms on the mci, you simply put
All DIMMs on MCI%d in there and done. Simple.

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-18 Thread Mauro Carvalho Chehab
Em Mon, 18 Feb 2013 14:52:51 +0100
Borislav Petkov b...@alien8.de escreveu:

 On Sun, Feb 17, 2013 at 07:44:04AM -0300, Mauro Carvalho Chehab wrote:
  We could do it for the location. The space for label, however, depends on
  how many DIMMs are in the system, as multiple dimm's may be present, and
  the core will point to all possible affected DIMMs.
  
  Ok, perhaps we could just allocate one big area for it (like one page), 
  as this would very likely be enough for it, and change the logic to take
  the buffer size into account when filling it.
 
 Or, in the case where -label is all dimms on the mci, you simply put
 All DIMMs on MCI%d in there and done. Simple.

The core does this already when it has no glue at all about where is the
error.

The core is prepared to the case where the location is only half-filled,
as this is a common scenario on the drivers, and important enough on
some memory controllers.

As already discussed, on most memory controllers nowadays, the memory
controller can't point to a single DIMM, as the error correction code
takes 128 bits (2 DIMMs). It is impossible for the error correction
code to determine on what DIMM an uncorrected error happened[1].

With Nehalem memory controllers, depending on the memory configuration,
the minimal DIMM granularity for an uncorrected error can be even worse: 
4 DIMMs, if 128-bits error correction code and mirror mode are both enabled.

There are some border cases where the driver can simply not discover on
what channel or on what dimm(or csrow) inside a channel the error
happened. The error could be associated with some failure at the logic
or at the bus that communicated with the Advanced Memory Buffers on an
FB-DIMM memory controller, for example.

So, the real core's worse case scenario would be if the driver can't
determine on what DIMM inside a channel the error happened. As a channel
can have a large number of DIMMs[2] the allocated area for the label
should be conservative.


 (16? Not sure what's the worse case),

[1] such error can even not be fatal, if that particular address is
unused.

[2] Currently, up to 8, according with:
$for i in $(git grep layers.*size\s*= drivers/edac|perl -ne 'print 
$1  if (m/\=\s*([A-Z][^\s]+);/);'); do echo $i; git grep $i drivers/edac; 
done|grep define|perl -ne 'print $1  if (m/define\s+[^\s]+\s(\d+)/)'
8 8 2 2 4 2 3 3 3 8 4 4 3 3 1 1 4 

and
$ git grep layers.*size\s*= drivers/edac|perl -ne 'print $1  if 
(m/\=\s*(\d+);/);'
1 1 1 1 2 2 8 4 1 1 1 1 

Nothing prevents that a driver would have more than 8 DIMMs per layer
in the future.

-- 

Cheers,
Mauro
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-17 Thread Mauro Carvalho Chehab
Em Sat, 16 Feb 2013 17:57:48 +0100
Borislav Petkov  escreveu:

> On Fri, Feb 15, 2013 at 04:20:29PM -0200, Mauro Carvalho Chehab wrote:
> > Yeah, pre-allocating a buffer is something that it was on my plans. It
> > seems it is time to do it in a clean way. I prefer to keep this as a
> > separate patch from 07/13, as it has a different rationale, and mixing
> > with 07/13 would just mix two different subjects.
> >
> > Also, having it separate helps reviewing.
> 
> Yep.
> 
> > ---
> > 
> > [PATCH] edac: put all arguments for the raw error handling call into a 
> > struct
> > 
> > The number of arguments for edac_raw_mc_handle_error() is too big;
> > put them into a structure and allocate space for it inside
> > edac_mc_alloc().
> > 
> > That reduces a lot the stack usage and simplifies the raw API call.
> > 
> > Tested with sb_edac driver and MCE error injection. Worked as expected:
> > 
> > [  143.066100] EDAC MC0: 1 CE memory read error on 
> > CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 
> > grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 
> > channel_mask:1 rank:0)
> > [  143.086424] EDAC MC0: 1 CE memory read error on 
> > CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 
> > grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 
> > channel_mask:1 rank:0)
> > [  143.106570] EDAC MC0: 1 CE memory read error on 
> > CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 
> > grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 
> > channel_mask:1 rank:0)
> > [  143.126712] EDAC MC0: 1 CE memory read error on 
> > CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 
> > grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 
> > channel_mask:1 rank:0)
> > 
> > Signed-off-by: Mauro Carvalho Chehab 
> > 
> > diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
> > index 9c5da11..9cf33a5 100644
> > --- a/drivers/edac/edac_core.h
> > +++ b/drivers/edac/edac_core.h
> > @@ -454,21 +454,20 @@ extern struct mem_ctl_info *edac_mc_del_mc(struct 
> > device *dev);
> >  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
> >   unsigned long page);
> >  
> > +static inline void edac_raw_error_desc_clean(struct edac_raw_error_desc *e)
> > +{
> > +   int offset = offsetof(struct edac_raw_error_desc, grain);
> > +
> > +   *e->location = '\0';
> > +   *e->label = '\0';
> 
> Why the special handling? Why not memset the whole thing?

We don't want to clean the pointers for the allocated area, just to
clean the strings.

> 
> > +
> > +   memset(e + offset, 0, sizeof(*e) - offset);
> > +}
> > +
> > +
> >  void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
> > - struct mem_ctl_info *mci,
> > - long grain,
> > - const u16 error_count,
> > - const int top_layer,
> > - const int mid_layer,
> > - const int low_layer,
> > - const unsigned long page_frame_number,
> > - const unsigned long offset_in_page,
> > - const unsigned long syndrome,
> > - const char *msg,
> > - const char *location,
> > - const char *label,
> > - const char *other_detail,
> > - const bool enable_per_layer_report);
> > + struct mem_ctl_info *mci,
> > + struct edac_raw_error_desc *e);
> >  
> >  void edac_mc_handle_error(const enum hw_event_mc_err_type type,
> >   struct mem_ctl_info *mci,
> > diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
> > index 8fddf65..d72853b 100644
> > --- a/drivers/edac/edac_mc.c
> > +++ b/drivers/edac/edac_mc.c
> > @@ -42,6 +42,12 @@
> >  static DEFINE_MUTEX(mem_ctls_mutex);
> >  static LIST_HEAD(mc_devices);
> >  
> > +/* Maximum size of the location string */
> > +#define LOCATION_SIZE 80
> > +
> > +/* String used to join two or more labels */
> > +#define OTHER_LABEL " or "
> > +
> >  /*
> >   * Used to lock EDAC MC to just one module, avoiding two drivers e. g.
> >   * apei/ghes and i7core_edac to be used at the same time.
> > @@ -232,6 +238,11 @@ static void _edac_mc_free(struct mem_ctl_info *mci)
> > }
> > kfree(mci->csrows);
> > }
> > +
> > +   /* Frees the error report string area */
> > +   kfree(mci->error_event.location);
> > +   kfree(mci->error_event.label);
> > +
> > kfree(mci);
> >  }
> >  
> > @@ -445,6 +456,12 @@ struct mem_ctl_info *edac_mc_alloc(unsigned mc_num,
> > }
> > }
> >  
> > +   /* Allocate memory for the error report */
> > +   mci->error_event.location = kmalloc(LOCATION_SIZE, GFP_KERNEL);
> > +   mci->error_event.label = kmalloc((EDAC_MC_LABEL_LEN + 1 +
> > +sizeof(OTHER_LABEL)) 

Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-17 Thread Mauro Carvalho Chehab
Em Sat, 16 Feb 2013 17:57:48 +0100
Borislav Petkov b...@alien8.de escreveu:

 On Fri, Feb 15, 2013 at 04:20:29PM -0200, Mauro Carvalho Chehab wrote:
  Yeah, pre-allocating a buffer is something that it was on my plans. It
  seems it is time to do it in a clean way. I prefer to keep this as a
  separate patch from 07/13, as it has a different rationale, and mixing
  with 07/13 would just mix two different subjects.
 
  Also, having it separate helps reviewing.
 
 Yep.
 
  ---
  
  [PATCH] edac: put all arguments for the raw error handling call into a 
  struct
  
  The number of arguments for edac_raw_mc_handle_error() is too big;
  put them into a structure and allocate space for it inside
  edac_mc_alloc().
  
  That reduces a lot the stack usage and simplifies the raw API call.
  
  Tested with sb_edac driver and MCE error injection. Worked as expected:
  
  [  143.066100] EDAC MC0: 1 CE memory read error on 
  CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 
  grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 
  channel_mask:1 rank:0)
  [  143.086424] EDAC MC0: 1 CE memory read error on 
  CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 
  grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 
  channel_mask:1 rank:0)
  [  143.106570] EDAC MC0: 1 CE memory read error on 
  CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 
  grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 
  channel_mask:1 rank:0)
  [  143.126712] EDAC MC0: 1 CE memory read error on 
  CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 
  grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 
  channel_mask:1 rank:0)
  
  Signed-off-by: Mauro Carvalho Chehab mche...@redhat.com
  
  diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
  index 9c5da11..9cf33a5 100644
  --- a/drivers/edac/edac_core.h
  +++ b/drivers/edac/edac_core.h
  @@ -454,21 +454,20 @@ extern struct mem_ctl_info *edac_mc_del_mc(struct 
  device *dev);
   extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
unsigned long page);
   
  +static inline void edac_raw_error_desc_clean(struct edac_raw_error_desc *e)
  +{
  +   int offset = offsetof(struct edac_raw_error_desc, grain);
  +
  +   *e-location = '\0';
  +   *e-label = '\0';
 
 Why the special handling? Why not memset the whole thing?

We don't want to clean the pointers for the allocated area, just to
clean the strings.

 
  +
  +   memset(e + offset, 0, sizeof(*e) - offset);
  +}
  +
  +
   void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
  - struct mem_ctl_info *mci,
  - long grain,
  - const u16 error_count,
  - const int top_layer,
  - const int mid_layer,
  - const int low_layer,
  - const unsigned long page_frame_number,
  - const unsigned long offset_in_page,
  - const unsigned long syndrome,
  - const char *msg,
  - const char *location,
  - const char *label,
  - const char *other_detail,
  - const bool enable_per_layer_report);
  + struct mem_ctl_info *mci,
  + struct edac_raw_error_desc *e);
   
   void edac_mc_handle_error(const enum hw_event_mc_err_type type,
struct mem_ctl_info *mci,
  diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
  index 8fddf65..d72853b 100644
  --- a/drivers/edac/edac_mc.c
  +++ b/drivers/edac/edac_mc.c
  @@ -42,6 +42,12 @@
   static DEFINE_MUTEX(mem_ctls_mutex);
   static LIST_HEAD(mc_devices);
   
  +/* Maximum size of the location string */
  +#define LOCATION_SIZE 80
  +
  +/* String used to join two or more labels */
  +#define OTHER_LABEL  or 
  +
   /*
* Used to lock EDAC MC to just one module, avoiding two drivers e. g.
* apei/ghes and i7core_edac to be used at the same time.
  @@ -232,6 +238,11 @@ static void _edac_mc_free(struct mem_ctl_info *mci)
  }
  kfree(mci-csrows);
  }
  +
  +   /* Frees the error report string area */
  +   kfree(mci-error_event.location);
  +   kfree(mci-error_event.label);
  +
  kfree(mci);
   }
   
  @@ -445,6 +456,12 @@ struct mem_ctl_info *edac_mc_alloc(unsigned mc_num,
  }
  }
   
  +   /* Allocate memory for the error report */
  +   mci-error_event.location = kmalloc(LOCATION_SIZE, GFP_KERNEL);
  +   mci-error_event.label = kmalloc((EDAC_MC_LABEL_LEN + 1 +
  +sizeof(OTHER_LABEL)) * mci-tot_dimms,
  +GFP_KERNEL);
 
 I see, those are separate strings. Why not embed them into struct
 edac_raw_error_desc? This would simplify the whole buffer handling even
 

Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-16 Thread Borislav Petkov
On Fri, Feb 15, 2013 at 04:20:29PM -0200, Mauro Carvalho Chehab wrote:
> Yeah, pre-allocating a buffer is something that it was on my plans. It
> seems it is time to do it in a clean way. I prefer to keep this as a
> separate patch from 07/13, as it has a different rationale, and mixing
> with 07/13 would just mix two different subjects.
>
> Also, having it separate helps reviewing.

Yep.

> ---
> 
> [PATCH] edac: put all arguments for the raw error handling call into a struct
> 
> The number of arguments for edac_raw_mc_handle_error() is too big;
> put them into a structure and allocate space for it inside
> edac_mc_alloc().
> 
> That reduces a lot the stack usage and simplifies the raw API call.
> 
> Tested with sb_edac driver and MCE error injection. Worked as expected:
> 
> [  143.066100] EDAC MC0: 1 CE memory read error on 
> CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 
> syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> [  143.086424] EDAC MC0: 1 CE memory read error on 
> CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 
> syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> [  143.106570] EDAC MC0: 1 CE memory read error on 
> CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 
> syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> [  143.126712] EDAC MC0: 1 CE memory read error on 
> CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 
> syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> 
> Signed-off-by: Mauro Carvalho Chehab 
> 
> diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
> index 9c5da11..9cf33a5 100644
> --- a/drivers/edac/edac_core.h
> +++ b/drivers/edac/edac_core.h
> @@ -454,21 +454,20 @@ extern struct mem_ctl_info *edac_mc_del_mc(struct 
> device *dev);
>  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
> unsigned long page);
>  
> +static inline void edac_raw_error_desc_clean(struct edac_raw_error_desc *e)
> +{
> + int offset = offsetof(struct edac_raw_error_desc, grain);
> +
> + *e->location = '\0';
> + *e->label = '\0';

Why the special handling? Why not memset the whole thing?

> +
> + memset(e + offset, 0, sizeof(*e) - offset);
> +}
> +
> +
>  void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
> -   struct mem_ctl_info *mci,
> -   long grain,
> -   const u16 error_count,
> -   const int top_layer,
> -   const int mid_layer,
> -   const int low_layer,
> -   const unsigned long page_frame_number,
> -   const unsigned long offset_in_page,
> -   const unsigned long syndrome,
> -   const char *msg,
> -   const char *location,
> -   const char *label,
> -   const char *other_detail,
> -   const bool enable_per_layer_report);
> +   struct mem_ctl_info *mci,
> +   struct edac_raw_error_desc *e);
>  
>  void edac_mc_handle_error(const enum hw_event_mc_err_type type,
> struct mem_ctl_info *mci,
> diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
> index 8fddf65..d72853b 100644
> --- a/drivers/edac/edac_mc.c
> +++ b/drivers/edac/edac_mc.c
> @@ -42,6 +42,12 @@
>  static DEFINE_MUTEX(mem_ctls_mutex);
>  static LIST_HEAD(mc_devices);
>  
> +/* Maximum size of the location string */
> +#define LOCATION_SIZE 80
> +
> +/* String used to join two or more labels */
> +#define OTHER_LABEL " or "
> +
>  /*
>   * Used to lock EDAC MC to just one module, avoiding two drivers e. g.
>   *   apei/ghes and i7core_edac to be used at the same time.
> @@ -232,6 +238,11 @@ static void _edac_mc_free(struct mem_ctl_info *mci)
>   }
>   kfree(mci->csrows);
>   }
> +
> + /* Frees the error report string area */
> + kfree(mci->error_event.location);
> + kfree(mci->error_event.label);
> +
>   kfree(mci);
>  }
>  
> @@ -445,6 +456,12 @@ struct mem_ctl_info *edac_mc_alloc(unsigned mc_num,
>   }
>   }
>  
> + /* Allocate memory for the error report */
> + mci->error_event.location = kmalloc(LOCATION_SIZE, GFP_KERNEL);
> + mci->error_event.label = kmalloc((EDAC_MC_LABEL_LEN + 1 +
> +  sizeof(OTHER_LABEL)) * mci->tot_dimms,
> +  GFP_KERNEL);

I see, those are separate strings. Why not embed them into struct
edac_raw_error_desc? This would simplify the whole buffer handling even
more and you won't need to kmalloc them.

Also just FYI, everytime you do kmalloc, you need to handle the case
where it returns 

Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-16 Thread Borislav Petkov
On Fri, Feb 15, 2013 at 04:20:29PM -0200, Mauro Carvalho Chehab wrote:
 Yeah, pre-allocating a buffer is something that it was on my plans. It
 seems it is time to do it in a clean way. I prefer to keep this as a
 separate patch from 07/13, as it has a different rationale, and mixing
 with 07/13 would just mix two different subjects.

 Also, having it separate helps reviewing.

Yep.

 ---
 
 [PATCH] edac: put all arguments for the raw error handling call into a struct
 
 The number of arguments for edac_raw_mc_handle_error() is too big;
 put them into a structure and allocate space for it inside
 edac_mc_alloc().
 
 That reduces a lot the stack usage and simplifies the raw API call.
 
 Tested with sb_edac driver and MCE error injection. Worked as expected:
 
 [  143.066100] EDAC MC0: 1 CE memory read error on 
 CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 
 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
 [  143.086424] EDAC MC0: 1 CE memory read error on 
 CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 
 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
 [  143.106570] EDAC MC0: 1 CE memory read error on 
 CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 
 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
 [  143.126712] EDAC MC0: 1 CE memory read error on 
 CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 
 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
 
 Signed-off-by: Mauro Carvalho Chehab mche...@redhat.com
 
 diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
 index 9c5da11..9cf33a5 100644
 --- a/drivers/edac/edac_core.h
 +++ b/drivers/edac/edac_core.h
 @@ -454,21 +454,20 @@ extern struct mem_ctl_info *edac_mc_del_mc(struct 
 device *dev);
  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
 unsigned long page);
  
 +static inline void edac_raw_error_desc_clean(struct edac_raw_error_desc *e)
 +{
 + int offset = offsetof(struct edac_raw_error_desc, grain);
 +
 + *e-location = '\0';
 + *e-label = '\0';

Why the special handling? Why not memset the whole thing?

 +
 + memset(e + offset, 0, sizeof(*e) - offset);
 +}
 +
 +
  void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
 -   struct mem_ctl_info *mci,
 -   long grain,
 -   const u16 error_count,
 -   const int top_layer,
 -   const int mid_layer,
 -   const int low_layer,
 -   const unsigned long page_frame_number,
 -   const unsigned long offset_in_page,
 -   const unsigned long syndrome,
 -   const char *msg,
 -   const char *location,
 -   const char *label,
 -   const char *other_detail,
 -   const bool enable_per_layer_report);
 +   struct mem_ctl_info *mci,
 +   struct edac_raw_error_desc *e);
  
  void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 struct mem_ctl_info *mci,
 diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
 index 8fddf65..d72853b 100644
 --- a/drivers/edac/edac_mc.c
 +++ b/drivers/edac/edac_mc.c
 @@ -42,6 +42,12 @@
  static DEFINE_MUTEX(mem_ctls_mutex);
  static LIST_HEAD(mc_devices);
  
 +/* Maximum size of the location string */
 +#define LOCATION_SIZE 80
 +
 +/* String used to join two or more labels */
 +#define OTHER_LABEL  or 
 +
  /*
   * Used to lock EDAC MC to just one module, avoiding two drivers e. g.
   *   apei/ghes and i7core_edac to be used at the same time.
 @@ -232,6 +238,11 @@ static void _edac_mc_free(struct mem_ctl_info *mci)
   }
   kfree(mci-csrows);
   }
 +
 + /* Frees the error report string area */
 + kfree(mci-error_event.location);
 + kfree(mci-error_event.label);
 +
   kfree(mci);
  }
  
 @@ -445,6 +456,12 @@ struct mem_ctl_info *edac_mc_alloc(unsigned mc_num,
   }
   }
  
 + /* Allocate memory for the error report */
 + mci-error_event.location = kmalloc(LOCATION_SIZE, GFP_KERNEL);
 + mci-error_event.label = kmalloc((EDAC_MC_LABEL_LEN + 1 +
 +  sizeof(OTHER_LABEL)) * mci-tot_dimms,
 +  GFP_KERNEL);

I see, those are separate strings. Why not embed them into struct
edac_raw_error_desc? This would simplify the whole buffer handling even
more and you won't need to kmalloc them.

Also just FYI, everytime you do kmalloc, you need to handle the case
where it returns an error.

[ … ]

 @@ -1174,18 +1162,26 @@ void edac_mc_handle_error(const enum 
 hw_event_mc_err_type 

Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-15 Thread Mauro Carvalho Chehab
Em Fri, 15 Feb 2013 17:02:57 +0100
Borislav Petkov  escreveu:

> On Fri, Feb 15, 2013 at 01:49:29PM -0200, Mauro Carvalho Chehab wrote:
> > Sure, but calling kmalloc while handling a memory error doesn't seem
> > a very good idea, IMHO. So, better to either use an already allocated
> > space (or the stack).
> 
> Either that, or prealloc a buffer on EDAC initialization. You probably
> won't need more than one in 99% of the cases so if you keep it simple
> with a single static buffer for starters, that would probably be the
> cleanest solution.
> 
> > Yes, I know, but, on the other hand, there's the additional cost of
> > copying almost all data into the structure.
> 
> That's very easily paralelizable on out-of-order CPUs (I'd say, all of
> them which need to run EDAC, can do that :-)) so it wouldn't hurt.
> 
> Also, you could allocate the struct in the callers and work directly
> with its members before sending it down to edac_raw_mc_handle_error() -
> that would probably simplify the code a bit more.

Yeah, pre-allocating a buffer is something that it was on my plans.
It seems it is time to do it in a clean way. I prefer to keep this
as a separate patch from 07/13, as it has a different rationale,
and mixing with 07/13 would just mix two different subjects.

Also, having it separate helps reviewing.

---

[PATCH] edac: put all arguments for the raw error handling call into a struct

The number of arguments for edac_raw_mc_handle_error() is too big;
put them into a structure and allocate space for it inside
edac_mc_alloc().

That reduces a lot the stack usage and simplifies the raw API call.

Tested with sb_edac driver and MCE error injection. Worked as expected:

[  143.066100] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 
(channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM 
err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.086424] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 
(channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM 
err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.106570] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 
(channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM 
err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.126712] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 
(channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM 
err_code:0001:0090 socket:0 channel_mask:1 rank:0)

Signed-off-by: Mauro Carvalho Chehab 

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 9c5da11..9cf33a5 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -454,21 +454,20 @@ extern struct mem_ctl_info *edac_mc_del_mc(struct device 
*dev);
 extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
  unsigned long page);
 
+static inline void edac_raw_error_desc_clean(struct edac_raw_error_desc *e)
+{
+   int offset = offsetof(struct edac_raw_error_desc, grain);
+
+   *e->location = '\0';
+   *e->label = '\0';
+
+   memset(e + offset, 0, sizeof(*e) - offset);
+}
+
+
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
- struct mem_ctl_info *mci,
- long grain,
- const u16 error_count,
- const int top_layer,
- const int mid_layer,
- const int low_layer,
- const unsigned long page_frame_number,
- const unsigned long offset_in_page,
- const unsigned long syndrome,
- const char *msg,
- const char *location,
- const char *label,
- const char *other_detail,
- const bool enable_per_layer_report);
+ struct mem_ctl_info *mci,
+ struct edac_raw_error_desc *e);
 
 void edac_mc_handle_error(const enum hw_event_mc_err_type type,
  struct mem_ctl_info *mci,
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 8fddf65..d72853b 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -42,6 +42,12 @@
 static DEFINE_MUTEX(mem_ctls_mutex);
 static LIST_HEAD(mc_devices);
 
+/* Maximum size of the location string */
+#define LOCATION_SIZE 80
+
+/* String used to join two or more labels */
+#define OTHER_LABEL " or "
+
 /*
  * Used to lock EDAC MC to just one module, avoiding two drivers e. g.
  * apei/ghes and i7core_edac to be used at the same time.
@@ -232,6 +238,11 @@ static void _edac_mc_free(struct mem_ctl_info *mci)
}
kfree(mci->csrows);
}
+
+   /* Frees the error report string area */
+   kfree(mci->error_event.location);
+   

Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-15 Thread Borislav Petkov
On Fri, Feb 15, 2013 at 01:49:29PM -0200, Mauro Carvalho Chehab wrote:
> Sure, but calling kmalloc while handling a memory error doesn't seem
> a very good idea, IMHO. So, better to either use an already allocated
> space (or the stack).

Either that, or prealloc a buffer on EDAC initialization. You probably
won't need more than one in 99% of the cases so if you keep it simple
with a single static buffer for starters, that would probably be the
cleanest solution.

> Yes, I know, but, on the other hand, there's the additional cost of
> copying almost all data into the structure.

That's very easily paralelizable on out-of-order CPUs (I'd say, all of
them which need to run EDAC, can do that :-)) so it wouldn't hurt.

Also, you could allocate the struct in the callers and work directly
with its members before sending it down to edac_raw_mc_handle_error() -
that would probably simplify the code a bit more.

> Ok, I'll keep this patch on my git. I'll likely fold it with the
> previous one on the final patchset.

Yep.

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-15 Thread Mauro Carvalho Chehab
Em Fri, 15 Feb 2013 16:41:23 +0100
Borislav Petkov  escreveu:

> On Fri, Feb 15, 2013 at 01:25:30PM -0200, Mauro Carvalho Chehab wrote:
> > Well, for sure using an structure will help to avoid missing a
> > parameter or exchanging its order. The stack usage won't reduce,
> > though, because the structure will keep using the stack.
> 
> If you allocate it on the stack of the caller, yes. If you kmalloc it,
> no.

Sure, but calling kmalloc while handling a memory error doesn't seem
a very good idea, IMHO. So, better to either use an already allocated
space (or the stack).
> 
> In any case, passing a pointer to struct edac_raw_error_desc only will
> allow on x86_64 (and i386 AFAICT) to use only registers to pass callee
> function arguments. Which is always a win. You probably need to stare at
> compiler output to see what gcc actually does with -O2 optimizations.

Yes, I know, but, on the other hand, there's the additional cost of
copying almost all data into the structure.

> > As I can't foresee the usage of this function call outside the core
> > and by the GHES driver, I'm not sure what would be the better.
> 
> Having an error descriptor is always better, even if it were only for
> clarity's and simplicity's sake.

Yes, the code is now clearer.

Ok, I'll keep this patch on my git. I'll likely fold it with the previous 
one on the final patchset.

-- 

Cheers,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-15 Thread Borislav Petkov
On Fri, Feb 15, 2013 at 01:25:30PM -0200, Mauro Carvalho Chehab wrote:
> Well, for sure using an structure will help to avoid missing a
> parameter or exchanging its order. The stack usage won't reduce,
> though, because the structure will keep using the stack.

If you allocate it on the stack of the caller, yes. If you kmalloc it,
no.

In any case, passing a pointer to struct edac_raw_error_desc only will
allow on x86_64 (and i386 AFAICT) to use only registers to pass callee
function arguments. Which is always a win. You probably need to stare at
compiler output to see what gcc actually does with -O2 optimizations.

> As I can't foresee the usage of this function call outside the core
> and by the GHES driver, I'm not sure what would be the better.

Having an error descriptor is always better, even if it were only for
clarity's and simplicity's sake.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-15 Thread Mauro Carvalho Chehab
Em Fri, 15 Feb 2013 15:13:30 +0100
Borislav Petkov  escreveu:

> On Fri, Feb 15, 2013 at 10:44:55AM -0200, Mauro Carvalho Chehab wrote:
> > That allows APEI GHES driver to report errors directly, using
> > the EDAC error report API.
> > 
> > Signed-off-by: Mauro Carvalho Chehab 
> > ---
> >  drivers/edac/edac_core.h |  17 
> >  drivers/edac/edac_mc.c   | 109 
> > ---
> >  2 files changed, 100 insertions(+), 26 deletions(-)
> > 
> > diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
> > index 23bb99f..9c5da11 100644
> > --- a/drivers/edac/edac_core.h
> > +++ b/drivers/edac/edac_core.h
> > @@ -453,6 +453,23 @@ extern struct mem_ctl_info *find_mci_by_dev(struct 
> > device *dev);
> >  extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
> >  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
> >   unsigned long page);
> > +
> > +void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
> > + struct mem_ctl_info *mci,
> > + long grain,
> > + const u16 error_count,
> > + const int top_layer,
> > + const int mid_layer,
> > + const int low_layer,
> > + const unsigned long page_frame_number,
> > + const unsigned long offset_in_page,
> > + const unsigned long syndrome,
> > + const char *msg,
> > + const char *location,
> > + const char *label,
> > + const char *other_detail,
> > + const bool enable_per_layer_report);
> 
> The argument count of this one looks like an overkill. Maybe it would be
> nicer, cleaner to do this:
> 
> void __edac_handle_mc_error(const enum hw_event_mc_err_type type,
>   struct mem_ctl_info *mci,
>   struct error_desc *e);
> 
> and struct error_desc collects all the remaining arguments.
> 
> This way you can't get the arguments order wrong, forget one or
> whatever; and it would be much less stack pressure on the function
> calls.

Well, for sure using an structure will help to avoid missing a parameter
or exchanging its order. The stack usage won't reduce, though, because
the structure will keep using the stack. As I can't foresee the usage
of this function call outside the core and by the GHES driver, I'm not
sure what would be the better.

Anyway, I moved it to an structure in the enclosed patch.

Regards,
Mauro

--


edac: put all arguments for the raw error handling call into a struct

The number of arguments for edac_raw_mc_handle_error() is too big;
put them into a structure.

Signed-off-by: Mauro Carvalho Chehab 

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 9c5da11..1574fec 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -381,6 +381,25 @@ struct edac_pci_ctl_info {
struct completion kobj_complete;
 };
 
+/*
+ * Raw error report structure
+ */
+struct edac_raw_error_desc {
+   long grain;
+   u16 error_count;
+   int top_layer;
+   int mid_layer;
+   int low_layer;
+   unsigned long page_frame_number;
+   unsigned long offset_in_page;
+   unsigned long syndrome;
+   const char *msg;
+   const char *location;
+   const char *label;
+   const char *other_detail;
+   bool enable_per_layer_report;
+};
+
 #define to_edac_pci_ctl_work(w) \
container_of(w, struct edac_pci_ctl_info,work)
 
@@ -455,20 +474,8 @@ extern int edac_mc_find_csrow_by_page(struct mem_ctl_info 
*mci,
  unsigned long page);
 
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
- struct mem_ctl_info *mci,
- long grain,
- const u16 error_count,
- const int top_layer,
- const int mid_layer,
- const int low_layer,
- const unsigned long page_frame_number,
- const unsigned long offset_in_page,
- const unsigned long syndrome,
- const char *msg,
- const char *location,
- const char *label,
- const char *other_detail,
- const bool enable_per_layer_report);
+ struct mem_ctl_info *mci,
+ struct edac_raw_error_desc *err);
 
 void edac_mc_handle_error(const enum hw_event_mc_err_type type,
  struct mem_ctl_info *mci,
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 8fddf65..b36f8f8 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1074,70 +1074,43 @@ static 

Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-15 Thread Borislav Petkov
On Fri, Feb 15, 2013 at 10:44:55AM -0200, Mauro Carvalho Chehab wrote:
> That allows APEI GHES driver to report errors directly, using
> the EDAC error report API.
> 
> Signed-off-by: Mauro Carvalho Chehab 
> ---
>  drivers/edac/edac_core.h |  17 
>  drivers/edac/edac_mc.c   | 109 
> ---
>  2 files changed, 100 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
> index 23bb99f..9c5da11 100644
> --- a/drivers/edac/edac_core.h
> +++ b/drivers/edac/edac_core.h
> @@ -453,6 +453,23 @@ extern struct mem_ctl_info *find_mci_by_dev(struct 
> device *dev);
>  extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
>  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
> unsigned long page);
> +
> +void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
> +   struct mem_ctl_info *mci,
> +   long grain,
> +   const u16 error_count,
> +   const int top_layer,
> +   const int mid_layer,
> +   const int low_layer,
> +   const unsigned long page_frame_number,
> +   const unsigned long offset_in_page,
> +   const unsigned long syndrome,
> +   const char *msg,
> +   const char *location,
> +   const char *label,
> +   const char *other_detail,
> +   const bool enable_per_layer_report);

The argument count of this one looks like an overkill. Maybe it would be
nicer, cleaner to do this:

void __edac_handle_mc_error(const enum hw_event_mc_err_type type,
struct mem_ctl_info *mci,
struct error_desc *e);

and struct error_desc collects all the remaining arguments.

This way you can't get the arguments order wrong, forget one or
whatever; and it would be much less stack pressure on the function
calls.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-15 Thread Mauro Carvalho Chehab
That allows APEI GHES driver to report errors directly, using
the EDAC error report API.

Signed-off-by: Mauro Carvalho Chehab 
---
 drivers/edac/edac_core.h |  17 
 drivers/edac/edac_mc.c   | 109 ---
 2 files changed, 100 insertions(+), 26 deletions(-)

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 23bb99f..9c5da11 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -453,6 +453,23 @@ extern struct mem_ctl_info *find_mci_by_dev(struct device 
*dev);
 extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
 extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
  unsigned long page);
+
+void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
+ struct mem_ctl_info *mci,
+ long grain,
+ const u16 error_count,
+ const int top_layer,
+ const int mid_layer,
+ const int low_layer,
+ const unsigned long page_frame_number,
+ const unsigned long offset_in_page,
+ const unsigned long syndrome,
+ const char *msg,
+ const char *location,
+ const char *label,
+ const char *other_detail,
+ const bool enable_per_layer_report);
+
 void edac_mc_handle_error(const enum hw_event_mc_err_type type,
  struct mem_ctl_info *mci,
  const u16 error_count,
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 8e33028..8fddf65 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1069,6 +1069,82 @@ static void edac_ue_error(struct mem_ctl_info *mci,
 #define OTHER_LABEL " or "
 
 /**
+ * edac_raw_mc_handle_error - reports a memory event to userspace without doing
+ *   anything to discover the error location
+ *
+ * @type:  severity of the error (CE/UE/Fatal)
+ * @mci:   a struct mem_ctl_info pointer
+ * @grain: error granularity
+ * @error_count:   Number of errors of the same type
+ * @top_layer: Memory layer[0] position
+ * @mid_layer: Memory layer[1] position
+ * @low_layer: Memory layer[2] position
+ * @page_frame_number: mem page where the error occurred
+ * @offset_in_page:offset of the error inside the page
+ * @syndrome:  ECC syndrome
+ * @msg:   Message meaningful to the end users that
+ * explains the event\
+ * @location:  location of the error, like "csrow:0 channel:1"
+ * @label: DIMM labels for the affected memory(ies)
+ * @other_detail:  Technical details about the event that
+ * may help hardware manufacturers and
+ * EDAC developers to analyse the event
+ * @enable_per_layer_report: should it increment per-layer error counts?
+ *
+ * This raw function is used internally by edac_mc_handle_error(). It should
+ * only be called directly when the hardware error come directly from BIOS,
+ * like in the case of APEI GHES driver.
+ */
+void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
+ struct mem_ctl_info *mci,
+ long grain,
+ const u16 error_count,
+ const int top_layer,
+ const int mid_layer,
+ const int low_layer,
+ const unsigned long page_frame_number,
+ const unsigned long offset_in_page,
+ const unsigned long syndrome,
+ const char *msg,
+ const char *location,
+ const char *label,
+ const char *other_detail,
+ const bool enable_per_layer_report)
+{
+   char detail[80];
+   u8 grain_bits;
+   int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
+
+   /* Report the error via the trace interface */
+   grain_bits = fls_long(grain) + 1;
+   trace_mc_event(type, msg, label, error_count,
+  mci->mc_idx, top_layer, mid_layer, low_layer,
+  PAGES_TO_MiB(page_frame_number) | offset_in_page,
+  grain_bits, syndrome, other_detail);
+
+   /* Memory type dependent details about the error */
+   if (type == HW_EVENT_ERR_CORRECTED) {
+   snprintf(detail, sizeof(detail),
+   "page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx",
+   page_frame_number, offset_in_page,
+   grain, syndrome);
+   edac_ce_error(mci, error_count, pos, msg, location, 

[PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-15 Thread Mauro Carvalho Chehab
That allows APEI GHES driver to report errors directly, using
the EDAC error report API.

Signed-off-by: Mauro Carvalho Chehab mche...@redhat.com
---
 drivers/edac/edac_core.h |  17 
 drivers/edac/edac_mc.c   | 109 ---
 2 files changed, 100 insertions(+), 26 deletions(-)

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 23bb99f..9c5da11 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -453,6 +453,23 @@ extern struct mem_ctl_info *find_mci_by_dev(struct device 
*dev);
 extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
 extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
  unsigned long page);
+
+void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
+ struct mem_ctl_info *mci,
+ long grain,
+ const u16 error_count,
+ const int top_layer,
+ const int mid_layer,
+ const int low_layer,
+ const unsigned long page_frame_number,
+ const unsigned long offset_in_page,
+ const unsigned long syndrome,
+ const char *msg,
+ const char *location,
+ const char *label,
+ const char *other_detail,
+ const bool enable_per_layer_report);
+
 void edac_mc_handle_error(const enum hw_event_mc_err_type type,
  struct mem_ctl_info *mci,
  const u16 error_count,
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 8e33028..8fddf65 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1069,6 +1069,82 @@ static void edac_ue_error(struct mem_ctl_info *mci,
 #define OTHER_LABEL  or 
 
 /**
+ * edac_raw_mc_handle_error - reports a memory event to userspace without doing
+ *   anything to discover the error location
+ *
+ * @type:  severity of the error (CE/UE/Fatal)
+ * @mci:   a struct mem_ctl_info pointer
+ * @grain: error granularity
+ * @error_count:   Number of errors of the same type
+ * @top_layer: Memory layer[0] position
+ * @mid_layer: Memory layer[1] position
+ * @low_layer: Memory layer[2] position
+ * @page_frame_number: mem page where the error occurred
+ * @offset_in_page:offset of the error inside the page
+ * @syndrome:  ECC syndrome
+ * @msg:   Message meaningful to the end users that
+ * explains the event\
+ * @location:  location of the error, like csrow:0 channel:1
+ * @label: DIMM labels for the affected memory(ies)
+ * @other_detail:  Technical details about the event that
+ * may help hardware manufacturers and
+ * EDAC developers to analyse the event
+ * @enable_per_layer_report: should it increment per-layer error counts?
+ *
+ * This raw function is used internally by edac_mc_handle_error(). It should
+ * only be called directly when the hardware error come directly from BIOS,
+ * like in the case of APEI GHES driver.
+ */
+void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
+ struct mem_ctl_info *mci,
+ long grain,
+ const u16 error_count,
+ const int top_layer,
+ const int mid_layer,
+ const int low_layer,
+ const unsigned long page_frame_number,
+ const unsigned long offset_in_page,
+ const unsigned long syndrome,
+ const char *msg,
+ const char *location,
+ const char *label,
+ const char *other_detail,
+ const bool enable_per_layer_report)
+{
+   char detail[80];
+   u8 grain_bits;
+   int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
+
+   /* Report the error via the trace interface */
+   grain_bits = fls_long(grain) + 1;
+   trace_mc_event(type, msg, label, error_count,
+  mci-mc_idx, top_layer, mid_layer, low_layer,
+  PAGES_TO_MiB(page_frame_number) | offset_in_page,
+  grain_bits, syndrome, other_detail);
+
+   /* Memory type dependent details about the error */
+   if (type == HW_EVENT_ERR_CORRECTED) {
+   snprintf(detail, sizeof(detail),
+   page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx,
+   page_frame_number, offset_in_page,
+   grain, syndrome);
+   edac_ce_error(mci, error_count, pos, msg, 

Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-15 Thread Borislav Petkov
On Fri, Feb 15, 2013 at 10:44:55AM -0200, Mauro Carvalho Chehab wrote:
 That allows APEI GHES driver to report errors directly, using
 the EDAC error report API.
 
 Signed-off-by: Mauro Carvalho Chehab mche...@redhat.com
 ---
  drivers/edac/edac_core.h |  17 
  drivers/edac/edac_mc.c   | 109 
 ---
  2 files changed, 100 insertions(+), 26 deletions(-)
 
 diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
 index 23bb99f..9c5da11 100644
 --- a/drivers/edac/edac_core.h
 +++ b/drivers/edac/edac_core.h
 @@ -453,6 +453,23 @@ extern struct mem_ctl_info *find_mci_by_dev(struct 
 device *dev);
  extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
 unsigned long page);
 +
 +void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
 +   struct mem_ctl_info *mci,
 +   long grain,
 +   const u16 error_count,
 +   const int top_layer,
 +   const int mid_layer,
 +   const int low_layer,
 +   const unsigned long page_frame_number,
 +   const unsigned long offset_in_page,
 +   const unsigned long syndrome,
 +   const char *msg,
 +   const char *location,
 +   const char *label,
 +   const char *other_detail,
 +   const bool enable_per_layer_report);

The argument count of this one looks like an overkill. Maybe it would be
nicer, cleaner to do this:

void __edac_handle_mc_error(const enum hw_event_mc_err_type type,
struct mem_ctl_info *mci,
struct error_desc *e);

and struct error_desc collects all the remaining arguments.

This way you can't get the arguments order wrong, forget one or
whatever; and it would be much less stack pressure on the function
calls.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-15 Thread Mauro Carvalho Chehab
Em Fri, 15 Feb 2013 15:13:30 +0100
Borislav Petkov b...@alien8.de escreveu:

 On Fri, Feb 15, 2013 at 10:44:55AM -0200, Mauro Carvalho Chehab wrote:
  That allows APEI GHES driver to report errors directly, using
  the EDAC error report API.
  
  Signed-off-by: Mauro Carvalho Chehab mche...@redhat.com
  ---
   drivers/edac/edac_core.h |  17 
   drivers/edac/edac_mc.c   | 109 
  ---
   2 files changed, 100 insertions(+), 26 deletions(-)
  
  diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
  index 23bb99f..9c5da11 100644
  --- a/drivers/edac/edac_core.h
  +++ b/drivers/edac/edac_core.h
  @@ -453,6 +453,23 @@ extern struct mem_ctl_info *find_mci_by_dev(struct 
  device *dev);
   extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
   extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
unsigned long page);
  +
  +void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
  + struct mem_ctl_info *mci,
  + long grain,
  + const u16 error_count,
  + const int top_layer,
  + const int mid_layer,
  + const int low_layer,
  + const unsigned long page_frame_number,
  + const unsigned long offset_in_page,
  + const unsigned long syndrome,
  + const char *msg,
  + const char *location,
  + const char *label,
  + const char *other_detail,
  + const bool enable_per_layer_report);
 
 The argument count of this one looks like an overkill. Maybe it would be
 nicer, cleaner to do this:
 
 void __edac_handle_mc_error(const enum hw_event_mc_err_type type,
   struct mem_ctl_info *mci,
   struct error_desc *e);
 
 and struct error_desc collects all the remaining arguments.
 
 This way you can't get the arguments order wrong, forget one or
 whatever; and it would be much less stack pressure on the function
 calls.

Well, for sure using an structure will help to avoid missing a parameter
or exchanging its order. The stack usage won't reduce, though, because
the structure will keep using the stack. As I can't foresee the usage
of this function call outside the core and by the GHES driver, I'm not
sure what would be the better.

Anyway, I moved it to an structure in the enclosed patch.

Regards,
Mauro

--


edac: put all arguments for the raw error handling call into a struct

The number of arguments for edac_raw_mc_handle_error() is too big;
put them into a structure.

Signed-off-by: Mauro Carvalho Chehab mche...@redhat.com

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 9c5da11..1574fec 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -381,6 +381,25 @@ struct edac_pci_ctl_info {
struct completion kobj_complete;
 };
 
+/*
+ * Raw error report structure
+ */
+struct edac_raw_error_desc {
+   long grain;
+   u16 error_count;
+   int top_layer;
+   int mid_layer;
+   int low_layer;
+   unsigned long page_frame_number;
+   unsigned long offset_in_page;
+   unsigned long syndrome;
+   const char *msg;
+   const char *location;
+   const char *label;
+   const char *other_detail;
+   bool enable_per_layer_report;
+};
+
 #define to_edac_pci_ctl_work(w) \
container_of(w, struct edac_pci_ctl_info,work)
 
@@ -455,20 +474,8 @@ extern int edac_mc_find_csrow_by_page(struct mem_ctl_info 
*mci,
  unsigned long page);
 
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
- struct mem_ctl_info *mci,
- long grain,
- const u16 error_count,
- const int top_layer,
- const int mid_layer,
- const int low_layer,
- const unsigned long page_frame_number,
- const unsigned long offset_in_page,
- const unsigned long syndrome,
- const char *msg,
- const char *location,
- const char *label,
- const char *other_detail,
- const bool enable_per_layer_report);
+ struct mem_ctl_info *mci,
+ struct edac_raw_error_desc *err);
 
 void edac_mc_handle_error(const enum hw_event_mc_err_type type,
  struct mem_ctl_info *mci,
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 8fddf65..b36f8f8 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1074,70 +1074,43 @@ static void edac_ue_error(struct 

Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-15 Thread Borislav Petkov
On Fri, Feb 15, 2013 at 01:25:30PM -0200, Mauro Carvalho Chehab wrote:
 Well, for sure using an structure will help to avoid missing a
 parameter or exchanging its order. The stack usage won't reduce,
 though, because the structure will keep using the stack.

If you allocate it on the stack of the caller, yes. If you kmalloc it,
no.

In any case, passing a pointer to struct edac_raw_error_desc only will
allow on x86_64 (and i386 AFAICT) to use only registers to pass callee
function arguments. Which is always a win. You probably need to stare at
compiler output to see what gcc actually does with -O2 optimizations.

 As I can't foresee the usage of this function call outside the core
 and by the GHES driver, I'm not sure what would be the better.

Having an error descriptor is always better, even if it were only for
clarity's and simplicity's sake.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-15 Thread Mauro Carvalho Chehab
Em Fri, 15 Feb 2013 16:41:23 +0100
Borislav Petkov b...@alien8.de escreveu:

 On Fri, Feb 15, 2013 at 01:25:30PM -0200, Mauro Carvalho Chehab wrote:
  Well, for sure using an structure will help to avoid missing a
  parameter or exchanging its order. The stack usage won't reduce,
  though, because the structure will keep using the stack.
 
 If you allocate it on the stack of the caller, yes. If you kmalloc it,
 no.

Sure, but calling kmalloc while handling a memory error doesn't seem
a very good idea, IMHO. So, better to either use an already allocated
space (or the stack).
 
 In any case, passing a pointer to struct edac_raw_error_desc only will
 allow on x86_64 (and i386 AFAICT) to use only registers to pass callee
 function arguments. Which is always a win. You probably need to stare at
 compiler output to see what gcc actually does with -O2 optimizations.

Yes, I know, but, on the other hand, there's the additional cost of
copying almost all data into the structure.

  As I can't foresee the usage of this function call outside the core
  and by the GHES driver, I'm not sure what would be the better.
 
 Having an error descriptor is always better, even if it were only for
 clarity's and simplicity's sake.

Yes, the code is now clearer.

Ok, I'll keep this patch on my git. I'll likely fold it with the previous 
one on the final patchset.

-- 

Cheers,
Mauro
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-15 Thread Borislav Petkov
On Fri, Feb 15, 2013 at 01:49:29PM -0200, Mauro Carvalho Chehab wrote:
 Sure, but calling kmalloc while handling a memory error doesn't seem
 a very good idea, IMHO. So, better to either use an already allocated
 space (or the stack).

Either that, or prealloc a buffer on EDAC initialization. You probably
won't need more than one in 99% of the cases so if you keep it simple
with a single static buffer for starters, that would probably be the
cleanest solution.

 Yes, I know, but, on the other hand, there's the additional cost of
 copying almost all data into the structure.

That's very easily paralelizable on out-of-order CPUs (I'd say, all of
them which need to run EDAC, can do that :-)) so it wouldn't hurt.

Also, you could allocate the struct in the callers and work directly
with its members before sending it down to edac_raw_mc_handle_error() -
that would probably simplify the code a bit more.

 Ok, I'll keep this patch on my git. I'll likely fold it with the
 previous one on the final patchset.

Yep.

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH EDAC 07/13] edac: add support for raw error reports

2013-02-15 Thread Mauro Carvalho Chehab
Em Fri, 15 Feb 2013 17:02:57 +0100
Borislav Petkov b...@alien8.de escreveu:

 On Fri, Feb 15, 2013 at 01:49:29PM -0200, Mauro Carvalho Chehab wrote:
  Sure, but calling kmalloc while handling a memory error doesn't seem
  a very good idea, IMHO. So, better to either use an already allocated
  space (or the stack).
 
 Either that, or prealloc a buffer on EDAC initialization. You probably
 won't need more than one in 99% of the cases so if you keep it simple
 with a single static buffer for starters, that would probably be the
 cleanest solution.
 
  Yes, I know, but, on the other hand, there's the additional cost of
  copying almost all data into the structure.
 
 That's very easily paralelizable on out-of-order CPUs (I'd say, all of
 them which need to run EDAC, can do that :-)) so it wouldn't hurt.
 
 Also, you could allocate the struct in the callers and work directly
 with its members before sending it down to edac_raw_mc_handle_error() -
 that would probably simplify the code a bit more.

Yeah, pre-allocating a buffer is something that it was on my plans.
It seems it is time to do it in a clean way. I prefer to keep this
as a separate patch from 07/13, as it has a different rationale,
and mixing with 07/13 would just mix two different subjects.

Also, having it separate helps reviewing.

---

[PATCH] edac: put all arguments for the raw error handling call into a struct

The number of arguments for edac_raw_mc_handle_error() is too big;
put them into a structure and allocate space for it inside
edac_mc_alloc().

That reduces a lot the stack usage and simplifies the raw API call.

Tested with sb_edac driver and MCE error injection. Worked as expected:

[  143.066100] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 
(channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM 
err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.086424] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 
(channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM 
err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.106570] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 
(channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM 
err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.126712] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 
(channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM 
err_code:0001:0090 socket:0 channel_mask:1 rank:0)

Signed-off-by: Mauro Carvalho Chehab mche...@redhat.com

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 9c5da11..9cf33a5 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -454,21 +454,20 @@ extern struct mem_ctl_info *edac_mc_del_mc(struct device 
*dev);
 extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
  unsigned long page);
 
+static inline void edac_raw_error_desc_clean(struct edac_raw_error_desc *e)
+{
+   int offset = offsetof(struct edac_raw_error_desc, grain);
+
+   *e-location = '\0';
+   *e-label = '\0';
+
+   memset(e + offset, 0, sizeof(*e) - offset);
+}
+
+
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
- struct mem_ctl_info *mci,
- long grain,
- const u16 error_count,
- const int top_layer,
- const int mid_layer,
- const int low_layer,
- const unsigned long page_frame_number,
- const unsigned long offset_in_page,
- const unsigned long syndrome,
- const char *msg,
- const char *location,
- const char *label,
- const char *other_detail,
- const bool enable_per_layer_report);
+ struct mem_ctl_info *mci,
+ struct edac_raw_error_desc *e);
 
 void edac_mc_handle_error(const enum hw_event_mc_err_type type,
  struct mem_ctl_info *mci,
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 8fddf65..d72853b 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -42,6 +42,12 @@
 static DEFINE_MUTEX(mem_ctls_mutex);
 static LIST_HEAD(mc_devices);
 
+/* Maximum size of the location string */
+#define LOCATION_SIZE 80
+
+/* String used to join two or more labels */
+#define OTHER_LABEL  or 
+
 /*
  * Used to lock EDAC MC to just one module, avoiding two drivers e. g.
  * apei/ghes and i7core_edac to be used at the same time.
@@ -232,6 +238,11 @@ static void _edac_mc_free(struct mem_ctl_info *mci)
}
kfree(mci-csrows);
}
+
+   /* Frees the error report string area */
+   kfree(mci-error_event.location);
+