On Mon, 29 Jul 2013 at 18:18 -0000, Stuart Barkley wrote:

> Hopefully these dumps can help.  I'll look at things more tomorrow.

A couple other notes I should have included earlier...

This occurs when doing rinv, etc on the entire cluster (100 compute
nodes, mostly x3650 M2 with a few 3850 systems).  I have not seen it
occur when only querying a single system or a single rack of systems.
Less and about 70 nodes seldom (if ever) causes problems.  100 nodes
often has issues with a few nodes (but not always).  Our 252 compute
node dx360 cluster give error more often than our 100 node cluster.

The nodes that fail vary and I haven't noticed any particular set of
nodes that have problems more often than others.

A recent example on the dx360 cluster (mc100 and mc172 are out of
service at this time).

    # date; rinv mc001-mc252,-mc100,-mc172 > /tmp/rinv.tmp
    Tue Jul 30 11:09:26 EDT 2013
    mc057: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc078: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc057: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc063: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc078: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc104: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc063: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc087: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc104: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc057: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc078: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc104: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc087: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc063: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc057: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc078: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc104: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc087: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc063: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc087: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc063: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc104: Error: Insufficient resources to create new session (wait for 
existing sessions to timeout)
    mc087: Error: timeout
    mc104: Error: timeout
    mc063: Error: timeout
    #

We are still running CentOS 5.8 on the xCAT server.  I hope to upgrade
to 6.4 in the next couple of days.

IMM and UEFI on these systems is a little dated, but I don't see
anything relevant in the latest release notes.

    BMC Firmware: 1.34 (YUOOE3E 2012/08/22 03:39:38)
    UEFI Version: 1.15 (D6E157A 2012/06/13)

Compute nodes are all running CentOS 6.4.

Some of the compute nodes are powered off and others are powered on.
We have code that turns off idle nodes and powers them back on as
needed.  During these tests, no nodes where changing power states.

Our code uses ipmitool to get power state and power consumption
information.  I ran some of these tests with that code disabled and
the problems still occurred.  xCAT should have been the only thing
doing any IPMI traffic.

The man page for the site table shows:

    ipmimaxp:  The max # of processes for ipmi hw ctrl. The default is
               64. Currently, this is only used for HP hw control.

    ipmiretries:  The # of retries to use when communicating with
                  BMCs. Default is 3.

    ipmitimeout:  The timeout to use when communicating with BMCs.
                  Default is 2.  This attribute is currently not used.

Is ipmimaxp really not used for IBM hardware?  This might be useful to
limit the number of outstanding requests.

Stuart Barkley
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone

------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent 
caught up. So what steps can you take to put your SQL databases under 
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
_______________________________________________
xCAT-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to