Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

banuchka Wed, 19 Apr 2017 07:33:12 -0700

All ok with noderange. I did regeneration of nodelist with conserver with 
“makeconservercfg -l” at some point of time, because it is simpler to track 
changes.


There are output when i tried to do the same with “makeconfluentcfg -l”:


==> /var/log/confluent/stderr <==
Apr 19 14:22:29   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): Traceback 
(most recent call last):
Apr 19 14:22:29   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):   File 
"/opt/confluent/bin/confluent", line 35, in <module>
Apr 19 14:22:29   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator): Traceback (most recent call last):
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457, in 
fire_timers
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     timer()
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/usr/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 58, in __call__
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     cb(*args, **kw)
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     result = function(*args, **kwargs)
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/opt/confluent/lib/python/confluent/shellmodule.py", line 52, in relaydata
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     3600 + (random.random() * 120))
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/usr/lib/python2.7/site-packages/eventlet/green/select.py", line 80, in select
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     listeners.append(hub.add(hub.READ, k, 
on_read, on_error, lambda x: None))
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/usr/lib/python2.7/site-packages/eventlet/hubs/epolls.py", line 49, in add
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     listener = BaseHub.add(self, evtype, 
fileno, cb, tb, mac)
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 177, in add
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     evtype, fileno, evtype, cb, bucket[fileno]))
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator): RuntimeError: Second simultaneous read on 
fileno 8 detected.  Unless you really know what you're doing, make sure that 
only one greenthread can read any particular socket.
 Consider using a pools.Pool. If you do know what you're doing and want to 
disable this error, call eventlet.debug.hub_prevent_multiple_readers(False) - 
MY THREAD=<function on_read at 0x7fb45c5f0de8>; THA
T THREAD=FdListener('read', 8, <built-in method switch of GreenThread object at 
0x7fb3fd6efd70>, <built-in method throw of GreenThread object at 
0x7fb3fd6efd70>)
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator): Traceback (most recent call last):
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457, in 
fire_timers
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     timer()
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/usr/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 58, in __call__
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     cb(*args, **kw)
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     result = function(*args, **kwargs)
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/opt/confluent/lib/python/confluent/shellmodule.py", line 52, in relaydata
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     3600 + (random.random() * 120))
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/usr/lib/python2.7/site-packages/eventlet/green/select.py", line 80, in select
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     listeners.append(hub.add(hub.READ, k, 
on_read, on_error, lambda x: None))
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/usr/lib/python2.7/site-packages/eventlet/hubs/epolls.py", line 49, in add
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     listener = BaseHub.add(self, evtype, 
fileno, cb, tb, mac)
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 177, in add
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     evtype, fileno, evtype, cb, bucket[fileno]))
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator): RuntimeError: Second simultaneous read on 
fileno 31 detected.  Unless you really know what you're doing, make sure that 
only one greenthread can read any particular socket.
  Consider using a pools.Pool. If you do know what you're doing and want to 
disable this error, call eventlet.debug.hub_prevent_multiple_readers(False) - 
MY THREAD=<function on_read at 0x7fb3fdfe4aa0>; TH
AT THREAD=FdListener('read', 31, <function on_read at 0x7fb41c771938>, 
<function on_error at 0x7fb3fe56b0c8>)


==> /var/log/confluent/trace <==
Apr 19 14:22:29 Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/consoleserver.py", line 220, in 
_connect_backend
    self._console.connect(self.get_console_output)
  File 
"/opt/confluent/lib/python/confluent/plugins/hardwaremanagement/ipmi.py", line 
237, in connect
    iohandler=self.handle_data)
  File "/usr/lib/python2.7/site-packages/pyghmi/ipmi/console.py", line 62, in 
__init__
    onlogon=self._got_session)
  File "/usr/lib/python2.7/site-packages/pyghmi/ipmi/private/session.py", line 
413, in __new__
    for res in socket.getaddrinfo(bmc, port, 0, socket.SOCK_DGRAM):
  File "/usr/lib/python2.7/site-packages/eventlet/support/greendns.py", line 
485, in getaddrinfo
    qname, addrs = _getaddrinfo_lookup(host, family, flags)
  File "/usr/lib/python2.7/site-packages/eventlet/support/greendns.py", line 
449, in _getaddrinfo_lookup
    answer = resolve(host, qfamily, False)
  File "/usr/lib/python2.7/site-packages/eventlet/support/greendns.py", line 
396, in resolve
    return resolver.query(name, rdtype, raise_on_no_answer=raises)
  File "/usr/lib/python2.7/site-packages/eventlet/support/greendns.py", line 
356, in query
    raise result[1]
TypeError: <lambda>() takes exactly 1 argument (0 given)

==> /var/log/confluent/stderr <==
Apr 19 14:22:33   File "/usr/lib64/python2.7/atexit.py", line 29, in 
_run_exitfuncs
    print >> sys.stderr, "Error in atexit._run_exitfuncs:": Error in 
atexit._run_exitfuncs:
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator): Traceback (most recent call last):
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/usr/lib64/python2.7/atexit.py", line 
24, in _run_exitfuncs
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     func(*targs, **kargs)
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/opt/confluent/lib/python/confluent/plugins/hardwaremanagement/ipmi.py", line 
39, in exithandler
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     console.session.iothread.join()
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/usr/lib/python2.7/site-packages/pyghmi/ipmi/private/session.py", line 74, in 
join
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     Session._cleanup()
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/usr/lib/python2.7/site-packages/pyghmi/ipmi/private/session.py", line 322, in 
_cleanup
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     for sesskey in cls.bmc_handlers:
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator): RuntimeError: dictionary changed size during 
iteration
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): Error in 
sys.exitfunc:
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): Traceback 
(most recent call last):
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):   File 
"/usr/lib64/python2.7/atexit.py", line 24, in _run_exitfuncs
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): 
func(*targs, **kargs)
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):   File 
"/opt/confluent/lib/python/confluent/plugins/hardwaremanagement/ipmi.py", line 
39, in exithandler
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): 
console.session.iothread.join()
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):   File 
"/usr/lib/python2.7/site-packages/pyghmi/ipmi/private/session.py", line 74, in 
join
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): 
Session._cleanup()
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):   File 
"/usr/lib/python2.7/site-packages/pyghmi/ipmi/private/session.py", line 322, in 
_cleanup
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): for 
sesskey in cls.bmc_handlers:
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): 
RuntimeError
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): :
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, 
in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): dictionary 
changed size during iteration


On 19 April 2017 at 15:01:27, Jarrod Johnson (jjohns...@lenovo.com) wrote:

Anything in the /var/log/confluent neighborhood when that restart  
happens? It doesn't happen if given a noderange?  

-----Original Message-----  
From: banuchka <tyrche...@gmail.com>  
To: xcat-user@lists.sourceforge.net <xcat-user@lists.sourceforge.net>,  
Jarrod Johnson <jjohns...@lenovo.com>  
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs  
~after 24h.  
Date: Wed, 19 Apr 2017 14:56:12 +0100  

Bad news :)  

On 19 April 2017 at 14:55:45, Jarrod Johnson (jjohns...@lenovo.com)  
wrote:  
> Confluent shouldn’t shut down or even restart…  
>    
> From: banuchka [mailto:tyrche...@gmail.com]   
> Sent: Wednesday, April 19, 2017 9:51 AM  
> To: xcat-user@lists.sourceforge.net; Jarrod Johnson  
> Subject: RE: [xcat-user] Confluent as console server. Consoles hangs  
> ~after 24h.  
>    
> And one more thing about Confluent:  
> is it expected behaviour when i did “makeconfluent” / “makeconfluent  
> -l”(confluent service is running) to regenerate nodes/add new nodes  
> confluent is shutting down…?  
> So for now I did some wrapper for that procedure(makeconfluent -d for  
> unneeded nodes, makeconfluent nodelist for new nodes).  
>    
> On 19 April 2017 at 14:41:30, Jarrod Johnson (jjohns...@lenovo.com)  
> wrote:  
> Ok, also were those login/logouts always there, or only after that  
> ‘try to suicide every 90 minutes’ experiment?  
>    
> From: banuchka [mailto:tyrche...@gmail.com]   
> Sent: Wednesday, April 19, 2017 9:38 AM  
> To: xcat-user@lists.sourceforge.net; Jarrod Johnson  
> Subject: Re: [xcat-user] Confluent as console server. Consoles hangs  
> ~after 24h.  
>    
> Bit follow up:  
> experiment with nodehealth+echo > /dev/console + rcons didn’t hang  
> console… maybe it need more time. Ill save it running inside tmux  
> session for bit long time.   
>    
> On 19 April 2017 at 14:07:58, banuchka (tyrche...@gmail.com) wrote:  
> Thanks Jarrod, I already have few “plugins” for old Sun servers  
> without SOL so it isn’t a big problem to create another one.  
> I really appreciate your help.  
> As one more thing I’m trying to fix all BaudRates on servers, because  
> as i can see on DRAC there are minimum 3 places with that setting(Im  
> not sure this is a problem, but it’s not a good practice to read and  
> write on different speed).  
> I’ll try your advice as well and let you know.  
>    
> On 19 April 2017 at 13:59:59, Jarrod Johnson (jjohns...@lenovo.com)  
> wrote:  
> I appreciate all the patience and help, let me know if you had a  
> request about making a shell plugin. The interface is not exactly  
> fleshed out ('CONFLUENT_NODE' is the only variable that makes it). If  
> the approach helps, I can accelerate a syntax for a shell module to  
> request more variables from the configuration (e.g.  
> CONFLUENT_HARDWAREMANAGEMENT_MANAGER SECRET_HARDWARMANAGEMENTUSER,  
> etc).  
>  
> In case you have a question, here's one example:  
> # cat  
> /opt/confluent.backup/lib/python/confluent/plugins/console/xcatkvm.sh  
>    
> #!/bin/bash  
> exec /opt/xcat/share/xcat/cons/kvm $CONFLUENT_NODE  
>  
>  
> As an aside, would you be able to do one more experiment? Start  
> confluent up, verify console is working, then run nodehealth a few  
> times against the node and see if it triggers the bad state?  
> Especially if you have some cron job that involves some node*  
> commands,  
> imitate that. I was trying to think about things that would be  
> different between ipmitool and pyghmi, and the one thing that occurs  
> to  
> me is that in pyghmi we try to multiplex commands and serial over the  
> same session to limit session consumption. In ipmitool, it's just SOL  
> (apart from an occasional 'get device id' for keepalive), so I'm  
> wondering if some timing or large volume of ipmi commands on a  
> session  
> with active sol session could mess up their BMC SOL session.  
>  
> Unfortunately, I don't have the resources to help chase this since I  
> can't reproduce it on our equipment, so all I can do is guessing  
> based  
> on comparative analysis.  
>  
> -----Original Message-----  
> From: banuchka <tyrche...@gmail.com>  
> To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>, Jarrod  
>  J  
> ohnson <jjohns...@lenovo.com>  
> Subject: RE: [xcat-user] Confluent as console server. Consoles hangs  
> ~after 24h.  
> Date: Wed, 19 Apr 2017 11:32:58 +0100  
>  
> Hi,  
>  
> I’m trying to use plugin for confluent with simple "ipmitool sol  
> activate” (placed here  
> /opt/confluent/lib/python/confluent/plugins/console/). It is last  
> attempt to understand whats going on here.  
> FW upgrade didn’t help me globally.  
> With current setup with pyghmi i see lots of “log on/log off”  
> messages  
> in BMC’s logs that doesn’t happen when im using ipmitool.  
> I’m out of ideas right now...  
>  
> On 14 April 2017 at 20:59:04, Jarrod Johnson (jjohns...@lenovo.com)  
> wrote:  
> > Yeah, there will be a bit push in the coming weeks it will have at  
> > least an ‘events’ log along with a lot more function.  
> >    
> > Then some more fleshed out documentation (beyond the preliminary  
> > stuff on hpc.lenovo.com).  
> >    
> > Let me know if the firmware exploration works out.  That particular  
> > change line suggests firmware upgrades, but it is possible they  
> could  
> > have some high BMC cpu usage that could manifest in such a way.   
> The  
> > ‘works with ipmitool’ though has me scratching my head.  
> >    
> > From: banuchka [mailto:tyrche...@gmail.com]   
> > Sent: Friday, April 14, 2017 2:54 PM  
> > To: xCAT Users Mailing list; Jarrod Johnson  
> > Subject: RE: [xcat-user] Confluent as console server. Consoles  
> hangs  
> > ~after 24h.  
> >    
> > Last idea doesn’t work for me. So by the way idea as is is working  
> > great – confluent does disconnect/connect after time in constant.  
> But  
> > for now it is 100% correct to say – it is a problem with IDRAC fw.  
> > from release notes for last fw:  
> > ===  
> > - Fix for occasional iDRAC unresponsiveness caused by upgrades via  
> > Firmware RACADM or  
> > have an active SOL or SSH sessions while firmware upgrade is in  
> > progress.  
> > ===  
> > I’m not sure, but maybe its something like i have here. So did the  
> > upgrade on few hosts and give them plenty of time to show me  
> results.  
> > Thanks for your answers, help and time… it is very interesting  
> quest  
> > :)  
> >    
> > Bit more about Confluent:  
> > - Interesting ambitions   
> > - Python VS Perl, thats good  
> > - I think log files(not just trace, stderr, stdout) and  
> > documentation(source on Github is the best doc o know, but…) are  
> > things that i would like to be in Confluent  
> >    
> > On 14 April 2017 at 19:27:20, Jarrod Johnson (jjohns...@lenovo.com)  
> > wrote:  
> > Very interested in the outcome.  And thank you for working through  
> > it.  Also interested what you have liked, would like, and have  
> > disliked about confluent.  
> >    
> > From: banuchka [mailto:tyrche...@gmail.com]   
> > Sent: Friday, April 14, 2017 12:01 PM  
> > To: xCAT Users Mailing list; Jarrod Johnson  
> > Subject: RE: [xcat-user] Confluent as console server. Consoles  
> hangs  
> > ~after 24h.  
> >    
> > Thank you Jarrod, i’ll try to add patch and let you know after.  
> Hope  
> > 90 minutes is enough, yes.  
> >    
> > On 14 April 2017 at 16:57:24, Jarrod Johnson (jjohns...@lenovo.com)  
> > wrote:  
> > Hmm, this is going to be very difficult to root cause (I only have  
> > Lenovo equipment as one might expect).  
> >    
> > I’m loathe to do a workaround, but in console.py (find /usr –name  
> > console.py) , might be interesting to see how a change like the  
> > following:  
> > diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py  
> > index 95e8551..a5f6062 100644  
> > --- a/pyghmi/ipmi/console.py  
> > +++ b/pyghmi/ipmi/console.py  
> > @@ -42,6 +42,7 @@ class Console(object):  
> >      def __init__(self, bmc, userid, password,  
> >                   iohandler, port=623,  
> >                   force=False, kg=None):  
> > +        self.keepalivecount = 0  
> >          self.keepaliveid = None  
> >          self.connected = False  
> >          self.broken = False  
> > @@ -70,6 +71,7 @@ class Console(object):  
> >          if 'error' in response:  
> >              self._print_error(response['error'])  
> >              return  
> > +        self.keepalivecount = 0  
> >          #Send activate sol payload directive  
> >          #netfn= 6 (application)  
> >          #command = 0x48 (activate payload)  
> > @@ -150,11 +152,12 @@ class Console(object):  
> >              return  
> >          currowner = struct.unpack(  
> >              "<I", struct.pack('4B', *response['data'][:4]))  
> > -        if currowner[0] != self.ipmi_session.sessionid:  
> > +        if currowner[0] != self.ipmi_session.sessionid or   
> > self.keepalivecount > 180:  
> >              # the session is deactivated or active for something  
> > else  
> >              self.activated = False  
> >              self._print_error('SOL deactivated')  
> >              return  
> > +        self.keepalivecount += 1  
> >          # ok, still here, that means session is alive, but another  
> >          # common issue is firmware messing with mux on reboot  
> >          # this would be a nice thing to check, but the serial  
> > channel  
> >    
> > If it would pan out, should cause the console session to disconnect  
> > itself roughly every 90 minutes and trigger reconnect (is 90  
> minutes  
> > short enough in your case?)  Would require a service confluent  
> > restart to see if it had the desired effect.  
> >    
> > Sorry I haven’t tested and can’t think of root cause, but going to  
> > take some time off for the weekend.  
> >    
> > I would be curious if the same ipmitool is running a day later than  
> a  
> > check (e.g. if ipmitool is exiting and getting restarted).  I don’t  
> > have the time at the moment to see if they do some other  
> interesting  
> > thing to avoid the behavior.  
> >    
> > From: banuchka [mailto:tyrche...@gmail.com]   
> > Sent: Friday, April 14, 2017 11:45 AM  
> > To: xCAT Users Mailing list; Jarrod Johnson  
> > Subject: RE: [xcat-user] Confluent as console server. Consoles  
> hangs  
> > ~after 24h.  
> >    
> > cloud53.ulan:/home/banuchka # ipmitool sol info 1  
> > Info: SOL parameter 'Payload Channel (7)' not supported -  
> defaulting  
> > to 0x01  
> > Set in progress                 : set-complete  
> > Enabled                         : true  
> > Force Encryption                : true  
> > Force Authentication            : false  
> > Privilege Level                 : ADMINISTRATOR  
> > Character Accumulate Level (ms) : 50  
> > Character Send Threshold        : 255  
> > Retry Count                     : 7  
> > Retry Interval (ms)             : 480  
> > Volatile Bit Rate (kbps)        : 38.4  
> > Non-Volatile Bit Rate (kbps)    : 115.2  
> > Payload Channel                 : 1 (0x01)  
> > Payload Port                    : 623  
> > cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate  
> > 115.2 1  
> > cloud53.ulan:/home/banuchka # ipmitool sol info 1  
> > Info: SOL parameter 'Payload Channel (7)' not supported -  
> defaulting  
> > to 0x01  
> > Set in progress                 : set-complete  
> > Enabled                         : true  
> > Force Encryption                : true  
> > Force Authentication            : false  
> > Privilege Level                 : ADMINISTRATOR  
> > Character Accumulate Level (ms) : 50  
> > Character Send Threshold        : 255  
> > Retry Count                     : 7  
> > Retry Interval (ms)             : 480  
> > Volatile Bit Rate (kbps)        : 115.2  
> > Non-Volatile Bit Rate (kbps)    : 115.2  
> > Payload Channel                 : 1 (0x01)  
> > Payload Port                    : 623  
> > cloud53.ulan:/home/banuchka # echo 123 > /dev/console  
> >    
> > and nothing happened  
> >    
> > in the console’s log  
> > —  
> > [04/14 12:49:12 console disconnected][04/14 12:49:29 console  
> > connected][04/14 13:01:02 console disconnected][04/14 13:01:02  
> > console connected][04/14 13:03:54 console disconnected][04/14  
> > 13:04:15 console connected][04/14 13:38:37 console connected][04/14  
> > 15:31:47 console disconnected][04/14 15:36:24 console  
> > connected][04/14 15:42:08 connection by xcat_console]  
> > ---  
> >    
> > On 14 April 2017 at 16:39:35, Jarrod Johnson (jjohns...@lenovo.com)  
> > wrote:  
> > If you do have any in corrupted state, would be interested to see  
> > what happens if you do:  
> > ipmitool sol set volatile-bit-rate 115.2 1  
> >    
> >    
> > To change the volatile bit rate to match the non-volatile bit rate  
> > and see if the corruption goes away.  
> >    
> > From: banuchka [mailto:tyrche...@gmail.com]   
> > Sent: Friday, April 14, 2017 11:36 AM  
> > To: xCAT Users Mailing list; Jarrod Johnson  
> > Subject: RE: [xcat-user] Confluent as console server. Consoles  
> hangs  
> > ~after 24h.  
> >    
> > 115200  
> >    
> > idracadm7 get iDRAC.IPMISerial  
> > [Key=iDRAC.Embedded.1#IPMISerial.1]  
> > BaudRate=115200  
> > ChanPrivLimit=4  
> > ConnectionMode=Terminal  
> > DeleteControl=Disabled  
> > EchoControl=Enabled  
> > FlowControl=RTS/CTS  
> > HandshakeControl=Enabled  
> > InputNewLineSeq=1  
> > LineEdit=Enabled  
> > NewLineSeq=CR-LF  
> >    
> > that is strange, right  
> >    
> > On 14 April 2017 at 16:31:27, Jarrod Johnson (jjohns...@lenovo.com)  
> > wrote:  
> > Hmm, what’s the baud rate the console is actually running at?  Odd  
> to  
> > see the volatile and non volatile bit rates not be the same.  
> >    
> > From: banuchka [mailto:tyrche...@gmail.com]   
> > Sent: Friday, April 14, 2017 11:28 AM  
> > To: xCAT Users Mailing list; Jarrod Johnson  
> > Subject: RE: [xcat-user] Confluent as console server. Consoles  
> hangs  
> > ~after 24h.  
> >    
> >    
> >    
> > On 14 April 2017 at 16:15:16, Jarrod Johnson (jjohns...@lenovo.com)  
> > wrote:  
> > And to be clear, the corruption only starts after a long period of  
> > time of being continuously connected?  
> > Yes, that is correct  
> >  
> >    
> > I might be interested in seeing ipmitool sol info 1 output against  
> a  
> > system while it is working versus showing corrupted info.  
> > corrupted:  
> > # ipmitool -I lanplus -H cloud2manage -U root -a sol info 1  
> > Password:  
> > Info: SOL parameter 'Payload Channel (7)' not supported -  
> defaulting  
> > to 0x01  
> > Set in progress                 : set-complete  
> > Enabled                         : true  
> > Force Encryption                : true  
> > Force Authentication            : false  
> > Privilege Level                 : ADMINISTRATOR  
> > Character Accumulate Level (ms) : 50  
> > Character Send Threshold        : 255  
> > Retry Count                     : 7  
> > Retry Interval (ms)             : 480  
> > Volatile Bit Rate (kbps)        : 38.4  
> > Non-Volatile Bit Rate (kbps)    : 115.2  
> > Payload Channel                 : 1 (0x01)  
> > Payload Port                    : 623  
> >    
> > Working:  
> > # ipmitool -I lanplus -H cloud2manage -U root -a sol info 1  
> > Password:  
> > Info: SOL parameter 'Payload Channel (7)' not supported -  
> defaulting  
> > to 0x01  
> > Set in progress                 : set-complete  
> > Enabled                         : true  
> > Force Encryption                : true  
> > Force Authentication            : false  
> > Privilege Level                 : ADMINISTRATOR  
> > Character Accumulate Level (ms) : 50  
> > Character Send Threshold        : 255  
> > Retry Count                     : 7  
> > Retry Interval (ms)             : 480  
> > Volatile Bit Rate (kbps)        : 38.4  
> > Non-Volatile Bit Rate (kbps)    : 115.2  
> > Payload Channel                 : 1 (0x01)  
> > Payload Port                    : 623  
> >  
> >    
> > From: banuchka [mailto:tyrche...@gmail.com]   
> > Sent: Friday, April 14, 2017 11:09 AM  
> > To: xCAT Users Mailing list; Jarrod Johnson  
> > Subject: RE: [xcat-user] Confluent as console server. Consoles  
> hangs  
> > ~after 24h.  
> >    
> > Yes, reopen causes it to work again,  without any garbage… so looks  
> > like normal console :)  
> > Hit <enter> causes at first garbage output(�� Por�lo) and *normal  
> > console* before...  
> >    
> > On 14 April 2017 at 16:02:09, Jarrod Johnson (jjohns...@lenovo.com)  
> > wrote:  
> > So reopen causes it to work again, and before, it’s not *hung*, but  
> > erratic with garbage characters and occasional blips of sanity?  
> >    
> > From: banuchka [mailto:tyrche...@gmail.com]   
> > Sent: Friday, April 14, 2017 11:00 AM  
> > To: xCAT Users Mailing list; Jarrod Johnson  
> > Subject: RE: [xcat-user] Confluent as console server. Consoles  
> hangs  
> > ~after 24h.  
> >    
> > Reopen console did the trick as well...  
> >    
> > On 14 April 2017 at 15:54:03, Jarrod Johnson (jjohns...@lenovo.com)  
> > wrote:  
> > ‘ctrl-e, then c, then o’ to reconnect.  
> >    
> > Was conserver ondemand or full logging?  
> >    
> > From: banuchka [mailto:tyrche...@gmail.com]   
> > Sent: Friday, April 14, 2017 10:52 AM  
> > To: xCAT Users Mailing list; Jarrod Johnson  
> > Subject: RE: [xcat-user] Confluent as console server. Consoles  
> hangs  
> > ~after 24h.  
> >    
> > Console starts showing garbage after <enter> inside rcons.  
> > What do you mean when said “restarting console”?  
> > Console continue its work after:  
> > - <enter> inside rcons/confetty  
> > - bmc reset (console disconnected/console connected)  
> >    
> > You’re absolutely right with ipmitool and conserver with the same  
> > servers we were out of such troubles.  
> > On 14 April 2017 at 15:47:14, Jarrod Johnson (jjohns...@lenovo.com)  
> > wrote:  
> > So the console starts showing garbage?  Restarting the console  
> causes  
> > the garbage to go away?  
> >    
> > You said that ipmitool with a certain configuration did not trigger  
> > this?  
> >    
> > From: banuchka [mailto:tyrche...@gmail.com]   
> > Sent: Friday, April 14, 2017 9:29 AM  
> > To: xCAT Users Mailing list; Jarrod Johnson  
> > Subject: Re: [xcat-user] Confluent as console server. Consoles  
> hangs  
> > ~after 24h.  
> >    
> > I’m out of ideas, let me show you all i see.  
> >    
> > Inside rcons i see:  
> >    
> > MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS  
> > (more complex log below)  
> >    
> > tcpdump(keepalive?):  
> >    
> > 13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags  
> [DF],  
> > proto UDP (17), length 92)  
> >     10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,  
> length  
> > 64  
> > 13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],  
> > proto UDP (17), length 108)  
> >     10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,  
> length  
> > 80  
> >    
> > …  
> >    
> > 13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags  
> [DF],  
> > proto UDP (17), length 92)  
> >     10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,  
> length  
> > 64  
> > 13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],  
> > proto UDP (17), length 108)  
> >     10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,  
> length  
> > 80  
> >    
> > Hit <enter> in rcons:  
> > ---  
> > MONITORING_TEST dbb54 1492160401  
> >    
> > ��  
> >   Por�  
> > —  
> >    
> > tcpdump:  
> > 13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags  
> [DF],  
> > proto UDP (17), length 92)  
> >     10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,  
> length  
> > 64  
> > 13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],  
> > proto UDP (17), length 108)  
> >     10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,  
> length  
> > 80  
> > 13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags  
> [DF],  
> > proto UDP (17), length 92)  
> >     10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,  
> length  
> > 64  
> > 13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],  
> > proto UDP (17), length 92)  
> >     10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,  
> length  
> > 64  
> > 13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],  
> > proto UDP (17), length 108)  
> >     10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,  
> length  
> > 80  
> > 13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags  
> [DF],  
> > proto UDP (17), length 92)  
> >     10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,  
> length  
> > 64  
> > 13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags  
> [DF],  
> > proto UDP (17), length 92)  
> >     10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,  
> length  
> > 64  
> > 13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],  
> > proto UDP (17), length 108)  
> >     10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,  
> length  
> > 80  
> > 13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags  
> [DF],  
> > proto UDP (17), length 92)  
> >     10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,  
> length  
> > 64  
> > 13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],  
> > proto UDP (17), length 108)  
> >     10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,  
> length  
> > 80  
> > 13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],  
> > proto UDP (17), length 92)  
> >     10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,  
> length  
> > 64  
> > 13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags  
> [DF],  
> > proto UDP (17), length 92)  
> >     10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,  
> length  
> > 64  
> > 13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],  
> > proto UDP (17), length 204)  
> >     10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,  
> length  
> > 176  
> > 13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags  
> [DF],  
> > proto UDP (17), length 92)  
> >     10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,  
> length  
> > 64  
> > 13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags  
> [DF],  
> > proto UDP (17), length 92)  
> >     10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,  
> length  
> > 64  
> > 13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],  
> > proto UDP (17), length 108)  
> >     10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,  
> length  
> > 80  
> >    
> > and Magic, rcons:  
> > ---  
> >   Por�lo]0;console: dbb54 [13:25]  
> >    
> >    
> > dbb54 login:  
> > ---  
> >    
> > On 14 April 2017 at 12:42:03, Jarrod Johnson (jjohns...@lenovo.com)  
> > wrote:  
> > If you ctrl-e, c, o, does it restore the console after the time?  
> >    
> > Can you tell that it goes after exactly 24hours on the dot?  
> >    
> > When console hung, does ‘ipmitool sol activate’ say ‘session  
> already  
> > active’?  
> > Yes,   
> > # ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate  
> > Password:  
> > Info: SOL payload already active on another session  
> >  
> >    
> > Does /var/log/confluent/consoles/<nodename> have any interesting  
> > events crop up?  
> > [04/13 15:17:21 console connected]  
> > … many our own messages  
> > ^MMONITORING_TEST dbb54 1492160401 | <== This is the last message  
> > from OS/ # date -d@1492160401 (Fri Apr 14 09:00:01 UTC 2017)  
> > ^M  
> > [04/14 09:05:13 console connected]  
> > [04/14 09:11:59 console connected]  
> > [04/14 09:13:38 console disconnected]  
> > [04/14 09:14:54 console connected]  
> > [04/14 10:15:13 connection by xcat_console]  
> > [04/14 10:15:14 disconnection by xcat_console]  
> > [04/14 13:14:30 connection by xcat_console]  
> >  
> >    
> > Pyghmi will do keepalive as well, and if that’s the problem, it  
> > should be much shorter than 24 hours.  In fact, it should be  
> checking  
> > if the SOL payload is active and owned by confluent specifically  
> > every couple of minutes.  
> > yes, thats correct  
> >  
> >    
> > From: banuchka [mailto:tyrche...@gmail.com]   
> > Sent: Friday, April 14, 2017 5:55 AM  
> > To: xcat-user@lists.sourceforge.net  
> > Subject: Re: [xcat-user] Confluent as console server. Consoles  
> hangs  
> > ~after 24h.  
> >    
> > My last reply was incorrect. Problems still here. Im trying to find  
> > something usefull inbetween confluent/pyghmi...  
> > Confluent restart solves hangs/reopen all connections.  
> > I think it isnt the best option to restart confluent 1 or 2 times  
> in  
> > 24h.  
> >  
> > --   
> > banuchka  
> > On 13 April 2017 at 17:03:19, banuchka (tyrche...@gmail.com) wrote:  
> > It is Dell’s related problem, not 100% but…  
> > Confluent from current master is doing things well :)   
> > Thanks for pretty nice tool “confluentdbutil".  
> >    
> > On 13 April 2017 at 11:30:14, banuchka (tyrche...@gmail.com) wrote:  
> > Looks like that problem was before… The fix was to use ipmitool  
> with  
> > keepalive(one from xcat repos).  
> > Here pyghmi is used maybe that the reason?  
> >    
> > On 13 April 2017 at 08:22:28, banuchka (tyrche...@gmail.com) wrote:  
> > Hi,   
> >    
> > Im trying to completely migrate from conserver to confluent, but  
> > catch strange behaviour.  
> > Some of my consoles hangs ~after 24, so no any new messages in  
> their  
> > logs or in rcons.  
> > I send messages with timestamp from OS >/dev/console every 30-60min  
> > and take a look on them for monitoring purposes(consoles  
> availability  
> > monitoring).  
> > I can open rcons and hit enter, after few secs console is waking  
> > up(strange). I didnt see it happen with conserver or maybe im  
> > wrong...  
> > Some details:  
> > - as i can see the bigest part of consoles with hangs behaviour are  
> > Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is  
> in  
> > use.  
> > - racreset hard/ipmitool bmc reset didnt do the things  
> > - hit enter to console wake it up(for example with expect i can  
> send  
> > \r\n\f, but it looks bad)  
> > - i didnt try to clean confluent's conf and restart it. Not sure it  
> > may help.  
> > - HP consoles works well, same ipmi  
> > - few consoles with custom pluging works good as well  
> >    
> > So maybe my question is not about confluent, but if some of you  
> have  
> > some knowledge about same problems please share it! ;)  
> >    
> > --   
> > banuchka  
> > --   
> > banuchka  
> > --   
> > banuchka  
> > -------------------------------------------------------------------  
> > -----------   
> > Check out the vibrant tech community on one of the world's most   
> > engaging tech sites, Slashdot.org! http://sdm.link/slashdot________  
> __  
> > _____________________________________   
> > xCAT-user mailing list   
> > xCAT-user@lists.sourceforge.net   
> > https://lists.sourceforge.net/lists/listinfo/xcat-user   
> >    
> >    
> >    
> > --   
> > banuchka  
> > --   
> > banuchka  
> > --   
> > banuchka  
> > --   
> > banuchka  
> >    
> >    
> > --   
> > banuchka  
> > --   
> > banuchka  
> > --   
> > banuchka  
> > --   
> > banuchka  
> > --   
> > banuchka  
>  
> --   
> banuchka  
> --   
> banuchka  
> --   
> banuchka  
> --   
> banuchka  

--   
banuchka
-- 
banuchka

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Reply via email to