Re: [Freeipmi-devel] ipmi-sensors crash

2013-12-20 Thread Dave Love
Albert Chu ch...@llnl.gov writes:

 On Thu, 2013-12-19 at 11:56 +, Dave Love wrote:
 I realized that the system that caused the crash has 601 sensors; really.
 Patch attached.

 Well how about that :-)

I should have apologized for not looking properly to start with, as the
stack had been smashed.

 Could you try running w/ --bridge-sensors

That's _very_ slow, and still gives N/A.

 or the 'assumebmcowner'
 workaround.  The sensor isn't owned by the BMC, leading to the N/A.

The workaround doesn't make any difference.

I don't know how it's done on the system, and I'm not sure we have
useful technical docs, but the individual BMCs in the component boxes
all report the sensors in all four boxes as far as I can tell.

This is a bit of an annoyance for our monitoring, but not a big deal,
but I can look again in the new year if you would like to address it.

___
Freeipmi-devel mailing list
Freeipmi-devel@gnu.org
https://lists.gnu.org/mailman/listinfo/freeipmi-devel


Re: [Freeipmi-devel] ipmi-sensors crash

2013-12-20 Thread Al Chu
Ping me after the new year and we can look at it more.  I'd like to see
the debug output of a --bridge-sensors with one of those ambient
sensors.

That's the one I expected to work.  Not what's going on.  Could be some
subtle bug ... or possibly an internal timeout or something.

Al

On Fri, 2013-12-20 at 14:33 +, Dave Love wrote:
 Albert Chu ch...@llnl.gov writes:
 
  On Thu, 2013-12-19 at 11:56 +, Dave Love wrote:
  I realized that the system that caused the crash has 601 sensors; really.
  Patch attached.
 
  Well how about that :-)
 
 I should have apologized for not looking properly to start with, as the
 stack had been smashed.
 
  Could you try running w/ --bridge-sensors
 
 That's _very_ slow, and still gives N/A.
 
  or the 'assumebmcowner'
  workaround.  The sensor isn't owned by the BMC, leading to the N/A.
 
 The workaround doesn't make any difference.
 
 I don't know how it's done on the system, and I'm not sure we have
 useful technical docs, but the individual BMCs in the component boxes
 all report the sensors in all four boxes as far as I can tell.
 
 This is a bit of an annoyance for our monitoring, but not a big deal,
 but I can look again in the new year if you would like to address it.
-- 
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


___
Freeipmi-devel mailing list
Freeipmi-devel@gnu.org
https://lists.gnu.org/mailman/listinfo/freeipmi-devel


Re: [Freeipmi-devel] ipmi-sensors crash

2013-12-19 Thread Dave Love
I realized that the system that caused the crash has 601 sensors; really.
Patch attached.

The sensors are still showing N/A, though.  --debug output for one of
them also attached.  The normal output for it is:

ID   | Name  | Type| Reading| Units | Event
4233 | Ambient Temp. | Temperature | N/A| C | N/A

and ipmitool shows this:

ADDR   | ID   | OwnerID  | Value  | Unit   | Status| 
LNR   | LC| LNC   | UNC   | UC| UNR   
   |  |  ||| /Mask | 
Thres.| Thres.| Thres.| Thres.| Thres.| Thres.
0x008000be | Ambient Temp.| 0x80 | 20.0   | degrees C  | ok| na 
   | 10.0  | na| 39.0  | 44.0  | na

(I'm not sure whether they're the same sensor, i.e. how ID and ADDR are
related, but the other three Ambients are the same.)

2013-12-19  Dave Love  f...@gnu.org

	* ipmi-sensors/ipmi-sensors.c (_calculate_record_ids): Check
	record numbr against array length.

	* common/toolcommon/tool-sensor-common.h (MAX_SENSOR_RECORD_IDS):
	Increase to 1024.

--- freeipmi-1.3.4/common/toolcommon/tool-sensor-common.h.orig	2013-04-26 18:01:55.0 +0100
+++ freeipmi-1.3.4/common/toolcommon/tool-sensor-common.h	2013-12-19 11:38:38.061632119 +
@@ -55,7 +55,7 @@
 #define MAX_SENSOR_TYPES256
 #else  /* !0 */
 /* achu: pick more reasonable limits than the theoretical maxes */
-#define MAX_SENSOR_RECORD_IDS   512
+#define MAX_SENSOR_RECORD_IDS   1024
 #define MAX_SENSOR_TYPES64
 #endif	/* !0 */
 #endif /* !__CYGWIN__ */
--- freeipmi-1.3.4/ipmi-sensors/ipmi-sensors.c.orig	2013-05-08 18:09:34.0 +0100
+++ freeipmi-1.3.4/ipmi-sensors/ipmi-sensors.c	2013-12-19 11:38:54.132859006 +
@@ -514,6 +514,13 @@
 }
   
   output_record_ids[(*output_record_ids_length)] = record_id;
+	  if (output_record_ids_length = MAX_SENSOR_RECORD_IDS)
+	{
+	  fprintf (stderr,
+		   Too many sensors; limit is %d\n,
+		   MAX_SENSOR_RECORD_IDS - 1);
+	  return (-1);
+	}
   (*output_record_ids_length)++;
 }
 }



ipmi-sensors.debug.gz
Description: debug output
___
Freeipmi-devel mailing list
Freeipmi-devel@gnu.org
https://lists.gnu.org/mailman/listinfo/freeipmi-devel


Re: [Freeipmi-devel] ipmi-sensors crash

2013-12-19 Thread Liebig, Holger
 
 I realized that the system that caused the crash has 601 sensors; really.
 Patch attached.
 

[Liebig, Holger] 
Out of curiosity: since the sensor number is limited to 8 bit are these 601 SDR 
grouped with satellite controllers or different LUN's? And models the SDR the 
complete SMP Box with all 4 nodes and one controller/BMC per node (dividing the 
600 SDR to 150/BMC) or just a single node with this impressive SDR?

Thanks,
Holger

___
Freeipmi-devel mailing list
Freeipmi-devel@gnu.org
https://lists.gnu.org/mailman/listinfo/freeipmi-devel


Re: [Freeipmi-devel] ipmi-sensors crash

2013-12-19 Thread Dave Love
Liebig, Holger holger.lie...@ts.fujitsu.com writes:

 
 I realized that the system that caused the crash has 601 sensors; really.
 Patch attached.
 

 [Liebig, Holger] 
 Out of curiosity: since the sensor number is limited to 8 bit are
 these 601 SDR grouped with satellite controllers or different LUN's?
 And models the SDR the complete SMP Box with all 4 nodes and one
 controller/BMC per node (dividing the 600 SDR to 150/BMC) or just a
 single node with this impressive SDR?

 Thanks,
 Holger

Sorry, I don't know enough about IPMI to answer, but I can provide data
if someone tells me how.  (Probably not before new year now.)

___
Freeipmi-devel mailing list
Freeipmi-devel@gnu.org
https://lists.gnu.org/mailman/listinfo/freeipmi-devel


Re: [Freeipmi-devel] ipmi-sensors crash

2013-12-19 Thread Albert Chu
On Thu, 2013-12-19 at 11:56 +, Dave Love wrote:
 I realized that the system that caused the crash has 601 sensors; really.
 Patch attached.

Well how about that :-)

 The sensors are still showing N/A, though.  --debug output for one of
 them also attached.  The normal output for it is:
 
 ID   | Name  | Type| Reading| Units | Event
 4233 | Ambient Temp. | Temperature | N/A| C | N/A
 
 and ipmitool shows this:
 
 ADDR   | ID   | OwnerID  | Value  | Unit   | Status| 
 LNR   | LC| LNC   | UNC   | UC| UNR   
|  |  ||| /Mask | 
 Thres.| Thres.| Thres.| Thres.| Thres.| Thres.
 0x008000be | Ambient Temp.| 0x80 | 20.0   | degrees C  | ok| 
 na| 10.0  | na| 39.0  | 44.0  | na
 
 (I'm not sure whether they're the same sensor, i.e. how ID and ADDR are
 related, but the other three Ambients are the same.)

Could you try running w/ --bridge-sensors or the 'assumebmcowner'
workaround.  The sensor isn't owned by the BMC, leading to the N/A.

Al

-- 
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory



___
Freeipmi-devel mailing list
Freeipmi-devel@gnu.org
https://lists.gnu.org/mailman/listinfo/freeipmi-devel


Re: [Freeipmi-devel] ipmi-sensors crash

2013-12-18 Thread Al Chu
Hi Dave,

Huh ... I'm sorta at a loss.  The state_data-prog_data data structure
is pretty core.  It's set once near the beginning in main and never
written to again once the main code is executed, which may include
 threads if you're doing hostranges.  Dunno if the threading could be
part of the problem for your unique system.

For kicks, another fellow on the mailing recently had a segv problem and
it appeared it was related to this.  Possible for you too?

http://www.gnu.org/software/freeipmi/freeipmi-faq.html#Why-am-I-seeing-so-many-_0027internal-IPMI-error_0027-or-_0027driver-busy_0027-messages_003f

Al

On Wed, 2013-12-18 at 15:02 +, Dave Love wrote:
 I got a segv trying to run ipmi-sensors (1.3.4).  It may be relevant
 that the system is somewhat unusual -- four (Bull) servers glued
 together to make a large SMP box.
 
 I don't have time to debug it properly, but here's a backtrace, although
 it may be junk, given the top of the stack.  I can send specific info
 that might be useful.  The --debug output is 120k compressed, so I
 haven't attached it.
 
 (gdb) bt
 #0  0x004050be in _calculate_record_ids (state_data=0x7fff4e10)
 at ipmi-sensors.c:485
 #1  _display_sensors (state_data=0x7fff4e10) at ipmi-sensors.c:1162
 #2  0x80048003 in ?? ()
 #3  0x7fff8005 in ?? ()
 #4  0x006371b0 in ?? ()
 #5  0x7fff5100 in ?? ()
 #6  0x in ?? ()
 (gdb) l
 480  ipmi_sdr_parse_record_id_and_type: 
 %s\n,
 481  ipmi_sdr_ctx_errormsg 
 (state_data-sdr_ctx));
 482 return (-1);
 483   }
 484   
 485 if (state_data-prog_data-args-exclude_record_ids_length)
 486   {
 487 int found_exclude = 0;
 488 
 489 for (j = 0; j  
 state_data-prog_data-args-exclude_record_ids_length; j++)
 (gdb) p *state_data-prog_data-args
 Cannot access memory at address 0x7fff800d
 (gdb) p *state_data-prog_data
 Cannot access memory at address 0x7fff8005
 (gdb) p *state_data
 $1 = {prog_data = 0x7fff8005, ipmi_ctx = 0x6371b0, 
   pstate = 0x7fff5100, hostname = 0x0, sdr_ctx = 0x637bd0, 
   sensor_read_ctx = 0x648050, interpret_ctx = 0x0, output_headers = 0, 
   column_width = {record_id = 5, sensor_name = 15, sensor_type = 23, 
 sensor_units = 5}, oem_data = {manufacturer_id = 0, product_id = 0, 
 ipmi_version_major = 0 '\000', ipmi_version_minor = 0 '\000'}, 
   intel_node_manager = {node_manager_data_found = 0, 
 nm_health_event_sensor_number = 0 '\000', 
 nm_exception_event_sensor_number = 0 '\000', 
 nm_operational_capabilities_sensor_number = 0 '\000', 
 nm_alert_threshold_exceeded_sensor_number = 0 '\000'}}
 ___
 Freeipmi-devel mailing list
 Freeipmi-devel@gnu.org
 https://lists.gnu.org/mailman/listinfo/freeipmi-devel
-- 
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


___
Freeipmi-devel mailing list
Freeipmi-devel@gnu.org
https://lists.gnu.org/mailman/listinfo/freeipmi-devel


Re: [Freeipmi-devel] ipmi-sensors crash

2013-12-18 Thread Albert Chu
Another thought, did you recompile the source?  I'm wondering if maybe
there's a linking issue with wrong library versions or what not.  I've
seen that happen with other users before.

Al

On Wed, 2013-12-18 at 21:40 +, Dave Love wrote:
 Al Chu ch...@llnl.gov writes:
 
  Hi Dave,
 
  Huh ... I'm sorta at a loss.  The state_data-prog_data data structure
  is pretty core.  It's set once near the beginning in main and never
  written to again once the main code is executed,
 
 The stack trace did suggest something had stomped on it.  I'll try to
 find time to have a proper look at it sometime.
 
  which may include
   threads if you're doing hostranges.  Dunno if the threading could be
  part of the problem for your unique system.
 
 This was just the one host, and gdb only showed one thread.
 
  For kicks, another fellow on the mailing recently had a segv problem and
  it appeared it was related to this.  Possible for you too?
 
  http://www.gnu.org/software/freeipmi/freeipmi-faq.html#Why-am-I-seeing-so-many-_0027internal-IPMI-error_0027-or-_0027driver-busy_0027-messages_003f
 
 I got the same thing out-of-band, which I assume would be independent of
 the kernel.  That's useful information, though, as I tend to turn off
 the ipmi service.
-- 
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory



___
Freeipmi-devel mailing list
Freeipmi-devel@gnu.org
https://lists.gnu.org/mailman/listinfo/freeipmi-devel