All ok with noderange. I did regeneration of nodelist with conserver with
“makeconservercfg -l” at some point of time, because it is simpler to track
changes.
There are output when i tried to do the same with “makeconfluentcfg -l”:
==> /var/log/confluent/stderr <==
Apr 19 14:22:29 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): Traceback
(most recent call last):
Apr 19 14:22:29 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): File
"/opt/confluent/bin/confluent", line 35, in <module>
Apr 19 14:22:29 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): Traceback (most recent call last):
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File
"/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457, in
fire_timers
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): timer()
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File
"/usr/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 58, in __call__
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): cb(*args, **kw)
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File
"/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): result = function(*args, **kwargs)
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File
"/opt/confluent/lib/python/confluent/shellmodule.py", line 52, in relaydata
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): 3600 + (random.random() * 120))
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File
"/usr/lib/python2.7/site-packages/eventlet/green/select.py", line 80, in select
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): listeners.append(hub.add(hub.READ, k,
on_read, on_error, lambda x: None))
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File
"/usr/lib/python2.7/site-packages/eventlet/hubs/epolls.py", line 49, in add
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): listener = BaseHub.add(self, evtype,
fileno, cb, tb, mac)
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File
"/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 177, in add
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): evtype, fileno, evtype, cb, bucket[fileno]))
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): RuntimeError: Second simultaneous read on
fileno 8 detected. Unless you really know what you're doing, make sure that
only one greenthread can read any particular socket.
Consider using a pools.Pool. If you do know what you're doing and want to
disable this error, call eventlet.debug.hub_prevent_multiple_readers(False) -
MY THREAD=<function on_read at 0x7fb45c5f0de8>; THA
T THREAD=FdListener('read', 8, <built-in method switch of GreenThread object at
0x7fb3fd6efd70>, <built-in method throw of GreenThread object at
0x7fb3fd6efd70>)
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): Traceback (most recent call last):
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File
"/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457, in
fire_timers
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): timer()
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File
"/usr/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 58, in __call__
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): cb(*args, **kw)
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File
"/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): result = function(*args, **kwargs)
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File
"/opt/confluent/lib/python/confluent/shellmodule.py", line 52, in relaydata
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): 3600 + (random.random() * 120))
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File
"/usr/lib/python2.7/site-packages/eventlet/green/select.py", line 80, in select
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): listeners.append(hub.add(hub.READ, k,
on_read, on_error, lambda x: None))
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File
"/usr/lib/python2.7/site-packages/eventlet/hubs/epolls.py", line 49, in add
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): listener = BaseHub.add(self, evtype,
fileno, cb, tb, mac)
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File
"/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 177, in add
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): evtype, fileno, evtype, cb, bucket[fileno]))
Apr 19 14:22:30 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): RuntimeError: Second simultaneous read on
fileno 31 detected. Unless you really know what you're doing, make sure that
only one greenthread can read any particular socket.
Consider using a pools.Pool. If you do know what you're doing and want to
disable this error, call eventlet.debug.hub_prevent_multiple_readers(False) -
MY THREAD=<function on_read at 0x7fb3fdfe4aa0>; TH
AT THREAD=FdListener('read', 31, <function on_read at 0x7fb41c771938>,
<function on_error at 0x7fb3fe56b0c8>)
==> /var/log/confluent/trace <==
Apr 19 14:22:29 Traceback (most recent call last):
File "/opt/confluent/lib/python/confluent/consoleserver.py", line 220, in
_connect_backend
self._console.connect(self.get_console_output)
File
"/opt/confluent/lib/python/confluent/plugins/hardwaremanagement/ipmi.py", line
237, in connect
iohandler=self.handle_data)
File "/usr/lib/python2.7/site-packages/pyghmi/ipmi/console.py", line 62, in
__init__
onlogon=self._got_session)
File "/usr/lib/python2.7/site-packages/pyghmi/ipmi/private/session.py", line
413, in __new__
for res in socket.getaddrinfo(bmc, port, 0, socket.SOCK_DGRAM):
File "/usr/lib/python2.7/site-packages/eventlet/support/greendns.py", line
485, in getaddrinfo
qname, addrs = _getaddrinfo_lookup(host, family, flags)
File "/usr/lib/python2.7/site-packages/eventlet/support/greendns.py", line
449, in _getaddrinfo_lookup
answer = resolve(host, qfamily, False)
File "/usr/lib/python2.7/site-packages/eventlet/support/greendns.py", line
396, in resolve
return resolver.query(name, rdtype, raise_on_no_answer=raises)
File "/usr/lib/python2.7/site-packages/eventlet/support/greendns.py", line
356, in query
raise result[1]
TypeError: <lambda>() takes exactly 1 argument (0 given)
==> /var/log/confluent/stderr <==
Apr 19 14:22:33 File "/usr/lib64/python2.7/atexit.py", line 29, in
_run_exitfuncs
print >> sys.stderr, "Error in atexit._run_exitfuncs:": Error in
atexit._run_exitfuncs:
Apr 19 14:22:33 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): Traceback (most recent call last):
Apr 19 14:22:33 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File "/usr/lib64/python2.7/atexit.py", line
24, in _run_exitfuncs
Apr 19 14:22:33 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): func(*targs, **kargs)
Apr 19 14:22:33 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File
"/opt/confluent/lib/python/confluent/plugins/hardwaremanagement/ipmi.py", line
39, in exithandler
Apr 19 14:22:33 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): console.session.iothread.join()
Apr 19 14:22:33 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File
"/usr/lib/python2.7/site-packages/pyghmi/ipmi/private/session.py", line 74, in
join
Apr 19 14:22:33 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): Session._cleanup()
Apr 19 14:22:33 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): File
"/usr/lib/python2.7/site-packages/pyghmi/ipmi/private/session.py", line 322, in
_cleanup
Apr 19 14:22:33 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): for sesskey in cls.bmc_handlers:
Apr 19 14:22:33 File "/usr/lib64/python2.7/traceback.py", line 13, in _print
file.write(str+terminator): RuntimeError: dictionary changed size during
iteration
Apr 19 14:22:33 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): Error in
sys.exitfunc:
Apr 19 14:22:33 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): Traceback
(most recent call last):
Apr 19 14:22:33 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): File
"/usr/lib64/python2.7/atexit.py", line 24, in _run_exitfuncs
Apr 19 14:22:33 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
Apr 19 14:22:33 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
func(*targs, **kargs)
Apr 19 14:22:33 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): File
"/opt/confluent/lib/python/confluent/plugins/hardwaremanagement/ipmi.py", line
39, in exithandler
Apr 19 14:22:33 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
Apr 19 14:22:33 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
console.session.iothread.join()
Apr 19 14:22:33 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): File
"/usr/lib/python2.7/site-packages/pyghmi/ipmi/private/session.py", line 74, in
join
Apr 19 14:22:33 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
Apr 19 14:22:33 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
Session._cleanup()
Apr 19 14:22:33 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): File
"/usr/lib/python2.7/site-packages/pyghmi/ipmi/private/session.py", line 322, in
_cleanup
Apr 19 14:22:33 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
Apr 19 14:22:33 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): for
sesskey in cls.bmc_handlers:
Apr 19 14:22:33 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
RuntimeError
Apr 19 14:22:33 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): :
Apr 19 14:22:33 File "/opt/confluent/lib/python/confluent/log.py", line 702,
in write
self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): dictionary
changed size during iteration
On 19 April 2017 at 15:01:27, Jarrod Johnson (jjohns...@lenovo.com) wrote:
Anything in the /var/log/confluent neighborhood when that restart
happens? It doesn't happen if given a noderange?
-----Original Message-----
From: banuchka <tyrche...@gmail.com>
To: xcat-user@lists.sourceforge.net <xcat-user@lists.sourceforge.net>,
Jarrod Johnson <jjohns...@lenovo.com>
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
Date: Wed, 19 Apr 2017 14:56:12 +0100
Bad news :)
On 19 April 2017 at 14:55:45, Jarrod Johnson (jjohns...@lenovo.com)
wrote:
> Confluent shouldn’t shut down or even restart…
>
> From: banuchka [mailto:tyrche...@gmail.com]
> Sent: Wednesday, April 19, 2017 9:51 AM
> To: xcat-user@lists.sourceforge.net; Jarrod Johnson
> Subject: RE: [xcat-user] Confluent as console server. Consoles hangs
> ~after 24h.
>
> And one more thing about Confluent:
> is it expected behaviour when i did “makeconfluent” / “makeconfluent
> -l”(confluent service is running) to regenerate nodes/add new nodes
> confluent is shutting down…?
> So for now I did some wrapper for that procedure(makeconfluent -d for
> unneeded nodes, makeconfluent nodelist for new nodes).
>
> On 19 April 2017 at 14:41:30, Jarrod Johnson (jjohns...@lenovo.com)
> wrote:
> Ok, also were those login/logouts always there, or only after that
> ‘try to suicide every 90 minutes’ experiment?
>
> From: banuchka [mailto:tyrche...@gmail.com]
> Sent: Wednesday, April 19, 2017 9:38 AM
> To: xcat-user@lists.sourceforge.net; Jarrod Johnson
> Subject: Re: [xcat-user] Confluent as console server. Consoles hangs
> ~after 24h.
>
> Bit follow up:
> experiment with nodehealth+echo > /dev/console + rcons didn’t hang
> console… maybe it need more time. Ill save it running inside tmux
> session for bit long time.
>
> On 19 April 2017 at 14:07:58, banuchka (tyrche...@gmail.com) wrote:
> Thanks Jarrod, I already have few “plugins” for old Sun servers
> without SOL so it isn’t a big problem to create another one.
> I really appreciate your help.
> As one more thing I’m trying to fix all BaudRates on servers, because
> as i can see on DRAC there are minimum 3 places with that setting(Im
> not sure this is a problem, but it’s not a good practice to read and
> write on different speed).
> I’ll try your advice as well and let you know.
>
> On 19 April 2017 at 13:59:59, Jarrod Johnson (jjohns...@lenovo.com)
> wrote:
> I appreciate all the patience and help, let me know if you had a
> request about making a shell plugin. The interface is not exactly
> fleshed out ('CONFLUENT_NODE' is the only variable that makes it). If
> the approach helps, I can accelerate a syntax for a shell module to
> request more variables from the configuration (e.g.
> CONFLUENT_HARDWAREMANAGEMENT_MANAGER SECRET_HARDWARMANAGEMENTUSER,
> etc).
>
> In case you have a question, here's one example:
> # cat
> /opt/confluent.backup/lib/python/confluent/plugins/console/xcatkvm.sh
>
> #!/bin/bash
> exec /opt/xcat/share/xcat/cons/kvm $CONFLUENT_NODE
>
>
> As an aside, would you be able to do one more experiment? Start
> confluent up, verify console is working, then run nodehealth a few
> times against the node and see if it triggers the bad state?
> Especially if you have some cron job that involves some node*
> commands,
> imitate that. I was trying to think about things that would be
> different between ipmitool and pyghmi, and the one thing that occurs
> to
> me is that in pyghmi we try to multiplex commands and serial over the
> same session to limit session consumption. In ipmitool, it's just SOL
> (apart from an occasional 'get device id' for keepalive), so I'm
> wondering if some timing or large volume of ipmi commands on a
> session
> with active sol session could mess up their BMC SOL session.
>
> Unfortunately, I don't have the resources to help chase this since I
> can't reproduce it on our equipment, so all I can do is guessing
> based
> on comparative analysis.
>
> -----Original Message-----
> From: banuchka <tyrche...@gmail.com>
> To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>, Jarrod
> J
> ohnson <jjohns...@lenovo.com>
> Subject: RE: [xcat-user] Confluent as console server. Consoles hangs
> ~after 24h.
> Date: Wed, 19 Apr 2017 11:32:58 +0100
>
> Hi,
>
> I’m trying to use plugin for confluent with simple "ipmitool sol
> activate” (placed here
> /opt/confluent/lib/python/confluent/plugins/console/). It is last
> attempt to understand whats going on here.
> FW upgrade didn’t help me globally.
> With current setup with pyghmi i see lots of “log on/log off”
> messages
> in BMC’s logs that doesn’t happen when im using ipmitool.
> I’m out of ideas right now...
>
> On 14 April 2017 at 20:59:04, Jarrod Johnson (jjohns...@lenovo.com)
> wrote:
> > Yeah, there will be a bit push in the coming weeks it will have at
> > least an ‘events’ log along with a lot more function.
> >
> > Then some more fleshed out documentation (beyond the preliminary
> > stuff on hpc.lenovo.com).
> >
> > Let me know if the firmware exploration works out. That particular
> > change line suggests firmware upgrades, but it is possible they
> could
> > have some high BMC cpu usage that could manifest in such a way.
> The
> > ‘works with ipmitool’ though has me scratching my head.
> >
> > From: banuchka [mailto:tyrche...@gmail.com]
> > Sent: Friday, April 14, 2017 2:54 PM
> > To: xCAT Users Mailing list; Jarrod Johnson
> > Subject: RE: [xcat-user] Confluent as console server. Consoles
> hangs
> > ~after 24h.
> >
> > Last idea doesn’t work for me. So by the way idea as is is working
> > great – confluent does disconnect/connect after time in constant.
> But
> > for now it is 100% correct to say – it is a problem with IDRAC fw.
> > from release notes for last fw:
> > ===
> > - Fix for occasional iDRAC unresponsiveness caused by upgrades via
> > Firmware RACADM or
> > have an active SOL or SSH sessions while firmware upgrade is in
> > progress.
> > ===
> > I’m not sure, but maybe its something like i have here. So did the
> > upgrade on few hosts and give them plenty of time to show me
> results.
> > Thanks for your answers, help and time… it is very interesting
> quest
> > :)
> >
> > Bit more about Confluent:
> > - Interesting ambitions
> > - Python VS Perl, thats good
> > - I think log files(not just trace, stderr, stdout) and
> > documentation(source on Github is the best doc o know, but…) are
> > things that i would like to be in Confluent
> >
> > On 14 April 2017 at 19:27:20, Jarrod Johnson (jjohns...@lenovo.com)
> > wrote:
> > Very interested in the outcome. And thank you for working through
> > it. Also interested what you have liked, would like, and have
> > disliked about confluent.
> >
> > From: banuchka [mailto:tyrche...@gmail.com]
> > Sent: Friday, April 14, 2017 12:01 PM
> > To: xCAT Users Mailing list; Jarrod Johnson
> > Subject: RE: [xcat-user] Confluent as console server. Consoles
> hangs
> > ~after 24h.
> >
> > Thank you Jarrod, i’ll try to add patch and let you know after.
> Hope
> > 90 minutes is enough, yes.
> >
> > On 14 April 2017 at 16:57:24, Jarrod Johnson (jjohns...@lenovo.com)
> > wrote:
> > Hmm, this is going to be very difficult to root cause (I only have
> > Lenovo equipment as one might expect).
> >
> > I’m loathe to do a workaround, but in console.py (find /usr –name
> > console.py) , might be interesting to see how a change like the
> > following:
> > diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
> > index 95e8551..a5f6062 100644
> > --- a/pyghmi/ipmi/console.py
> > +++ b/pyghmi/ipmi/console.py
> > @@ -42,6 +42,7 @@ class Console(object):
> > def __init__(self, bmc, userid, password,
> > iohandler, port=623,
> > force=False, kg=None):
> > + self.keepalivecount = 0
> > self.keepaliveid = None
> > self.connected = False
> > self.broken = False
> > @@ -70,6 +71,7 @@ class Console(object):
> > if 'error' in response:
> > self._print_error(response['error'])
> > return
> > + self.keepalivecount = 0
> > #Send activate sol payload directive
> > #netfn= 6 (application)
> > #command = 0x48 (activate payload)
> > @@ -150,11 +152,12 @@ class Console(object):
> > return
> > currowner = struct.unpack(
> > "<I", struct.pack('4B', *response['data'][:4]))
> > - if currowner[0] != self.ipmi_session.sessionid:
> > + if currowner[0] != self.ipmi_session.sessionid or
> > self.keepalivecount > 180:
> > # the session is deactivated or active for something
> > else
> > self.activated = False
> > self._print_error('SOL deactivated')
> > return
> > + self.keepalivecount += 1
> > # ok, still here, that means session is alive, but another
> > # common issue is firmware messing with mux on reboot
> > # this would be a nice thing to check, but the serial
> > channel
> >
> > If it would pan out, should cause the console session to disconnect
> > itself roughly every 90 minutes and trigger reconnect (is 90
> minutes
> > short enough in your case?) Would require a service confluent
> > restart to see if it had the desired effect.
> >
> > Sorry I haven’t tested and can’t think of root cause, but going to
> > take some time off for the weekend.
> >
> > I would be curious if the same ipmitool is running a day later than
> a
> > check (e.g. if ipmitool is exiting and getting restarted). I don’t
> > have the time at the moment to see if they do some other
> interesting
> > thing to avoid the behavior.
> >
> > From: banuchka [mailto:tyrche...@gmail.com]
> > Sent: Friday, April 14, 2017 11:45 AM
> > To: xCAT Users Mailing list; Jarrod Johnson
> > Subject: RE: [xcat-user] Confluent as console server. Consoles
> hangs
> > ~after 24h.
> >
> > cloud53.ulan:/home/banuchka # ipmitool sol info 1
> > Info: SOL parameter 'Payload Channel (7)' not supported -
> defaulting
> > to 0x01
> > Set in progress : set-complete
> > Enabled : true
> > Force Encryption : true
> > Force Authentication : false
> > Privilege Level : ADMINISTRATOR
> > Character Accumulate Level (ms) : 50
> > Character Send Threshold : 255
> > Retry Count : 7
> > Retry Interval (ms) : 480
> > Volatile Bit Rate (kbps) : 38.4
> > Non-Volatile Bit Rate (kbps) : 115.2
> > Payload Channel : 1 (0x01)
> > Payload Port : 623
> > cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate
> > 115.2 1
> > cloud53.ulan:/home/banuchka # ipmitool sol info 1
> > Info: SOL parameter 'Payload Channel (7)' not supported -
> defaulting
> > to 0x01
> > Set in progress : set-complete
> > Enabled : true
> > Force Encryption : true
> > Force Authentication : false
> > Privilege Level : ADMINISTRATOR
> > Character Accumulate Level (ms) : 50
> > Character Send Threshold : 255
> > Retry Count : 7
> > Retry Interval (ms) : 480
> > Volatile Bit Rate (kbps) : 115.2
> > Non-Volatile Bit Rate (kbps) : 115.2
> > Payload Channel : 1 (0x01)
> > Payload Port : 623
> > cloud53.ulan:/home/banuchka # echo 123 > /dev/console
> >
> > and nothing happened
> >
> > in the console’s log
> > —
> > [04/14 12:49:12 console disconnected][04/14 12:49:29 console
> > connected][04/14 13:01:02 console disconnected][04/14 13:01:02
> > console connected][04/14 13:03:54 console disconnected][04/14
> > 13:04:15 console connected][04/14 13:38:37 console connected][04/14
> > 15:31:47 console disconnected][04/14 15:36:24 console
> > connected][04/14 15:42:08 connection by xcat_console]
> > ---
> >
> > On 14 April 2017 at 16:39:35, Jarrod Johnson (jjohns...@lenovo.com)
> > wrote:
> > If you do have any in corrupted state, would be interested to see
> > what happens if you do:
> > ipmitool sol set volatile-bit-rate 115.2 1
> >
> >
> > To change the volatile bit rate to match the non-volatile bit rate
> > and see if the corruption goes away.
> >
> > From: banuchka [mailto:tyrche...@gmail.com]
> > Sent: Friday, April 14, 2017 11:36 AM
> > To: xCAT Users Mailing list; Jarrod Johnson
> > Subject: RE: [xcat-user] Confluent as console server. Consoles
> hangs
> > ~after 24h.
> >
> > 115200
> >
> > idracadm7 get iDRAC.IPMISerial
> > [Key=iDRAC.Embedded.1#IPMISerial.1]
> > BaudRate=115200
> > ChanPrivLimit=4
> > ConnectionMode=Terminal
> > DeleteControl=Disabled
> > EchoControl=Enabled
> > FlowControl=RTS/CTS
> > HandshakeControl=Enabled
> > InputNewLineSeq=1
> > LineEdit=Enabled
> > NewLineSeq=CR-LF
> >
> > that is strange, right
> >
> > On 14 April 2017 at 16:31:27, Jarrod Johnson (jjohns...@lenovo.com)
> > wrote:
> > Hmm, what’s the baud rate the console is actually running at? Odd
> to
> > see the volatile and non volatile bit rates not be the same.
> >
> > From: banuchka [mailto:tyrche...@gmail.com]
> > Sent: Friday, April 14, 2017 11:28 AM
> > To: xCAT Users Mailing list; Jarrod Johnson
> > Subject: RE: [xcat-user] Confluent as console server. Consoles
> hangs
> > ~after 24h.
> >
> >
> >
> > On 14 April 2017 at 16:15:16, Jarrod Johnson (jjohns...@lenovo.com)
> > wrote:
> > And to be clear, the corruption only starts after a long period of
> > time of being continuously connected?
> > Yes, that is correct
> >
> >
> > I might be interested in seeing ipmitool sol info 1 output against
> a
> > system while it is working versus showing corrupted info.
> > corrupted:
> > # ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
> > Password:
> > Info: SOL parameter 'Payload Channel (7)' not supported -
> defaulting
> > to 0x01
> > Set in progress : set-complete
> > Enabled : true
> > Force Encryption : true
> > Force Authentication : false
> > Privilege Level : ADMINISTRATOR
> > Character Accumulate Level (ms) : 50
> > Character Send Threshold : 255
> > Retry Count : 7
> > Retry Interval (ms) : 480
> > Volatile Bit Rate (kbps) : 38.4
> > Non-Volatile Bit Rate (kbps) : 115.2
> > Payload Channel : 1 (0x01)
> > Payload Port : 623
> >
> > Working:
> > # ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
> > Password:
> > Info: SOL parameter 'Payload Channel (7)' not supported -
> defaulting
> > to 0x01
> > Set in progress : set-complete
> > Enabled : true
> > Force Encryption : true
> > Force Authentication : false
> > Privilege Level : ADMINISTRATOR
> > Character Accumulate Level (ms) : 50
> > Character Send Threshold : 255
> > Retry Count : 7
> > Retry Interval (ms) : 480
> > Volatile Bit Rate (kbps) : 38.4
> > Non-Volatile Bit Rate (kbps) : 115.2
> > Payload Channel : 1 (0x01)
> > Payload Port : 623
> >
> >
> > From: banuchka [mailto:tyrche...@gmail.com]
> > Sent: Friday, April 14, 2017 11:09 AM
> > To: xCAT Users Mailing list; Jarrod Johnson
> > Subject: RE: [xcat-user] Confluent as console server. Consoles
> hangs
> > ~after 24h.
> >
> > Yes, reopen causes it to work again, without any garbage… so looks
> > like normal console :)
> > Hit <enter> causes at first garbage output(�� Por�lo) and *normal
> > console* before...
> >
> > On 14 April 2017 at 16:02:09, Jarrod Johnson (jjohns...@lenovo.com)
> > wrote:
> > So reopen causes it to work again, and before, it’s not *hung*, but
> > erratic with garbage characters and occasional blips of sanity?
> >
> > From: banuchka [mailto:tyrche...@gmail.com]
> > Sent: Friday, April 14, 2017 11:00 AM
> > To: xCAT Users Mailing list; Jarrod Johnson
> > Subject: RE: [xcat-user] Confluent as console server. Consoles
> hangs
> > ~after 24h.
> >
> > Reopen console did the trick as well...
> >
> > On 14 April 2017 at 15:54:03, Jarrod Johnson (jjohns...@lenovo.com)
> > wrote:
> > ‘ctrl-e, then c, then o’ to reconnect.
> >
> > Was conserver ondemand or full logging?
> >
> > From: banuchka [mailto:tyrche...@gmail.com]
> > Sent: Friday, April 14, 2017 10:52 AM
> > To: xCAT Users Mailing list; Jarrod Johnson
> > Subject: RE: [xcat-user] Confluent as console server. Consoles
> hangs
> > ~after 24h.
> >
> > Console starts showing garbage after <enter> inside rcons.
> > What do you mean when said “restarting console”?
> > Console continue its work after:
> > - <enter> inside rcons/confetty
> > - bmc reset (console disconnected/console connected)
> >
> > You’re absolutely right with ipmitool and conserver with the same
> > servers we were out of such troubles.
> > On 14 April 2017 at 15:47:14, Jarrod Johnson (jjohns...@lenovo.com)
> > wrote:
> > So the console starts showing garbage? Restarting the console
> causes
> > the garbage to go away?
> >
> > You said that ipmitool with a certain configuration did not trigger
> > this?
> >
> > From: banuchka [mailto:tyrche...@gmail.com]
> > Sent: Friday, April 14, 2017 9:29 AM
> > To: xCAT Users Mailing list; Jarrod Johnson
> > Subject: Re: [xcat-user] Confluent as console server. Consoles
> hangs
> > ~after 24h.
> >
> > I’m out of ideas, let me show you all i see.
> >
> > Inside rcons i see:
> >
> > MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS
> > (more complex log below)
> >
> > tcpdump(keepalive?):
> >
> > 13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags
> [DF],
> > proto UDP (17), length 92)
> > 10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
> length
> > 64
> > 13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
> > proto UDP (17), length 108)
> > 10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
> length
> > 80
> >
> > …
> >
> > 13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags
> [DF],
> > proto UDP (17), length 92)
> > 10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
> length
> > 64
> > 13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
> > proto UDP (17), length 108)
> > 10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
> length
> > 80
> >
> > Hit <enter> in rcons:
> > ---
> > MONITORING_TEST dbb54 1492160401
> >
> > ��
> > Por�
> > —
> >
> > tcpdump:
> > 13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags
> [DF],
> > proto UDP (17), length 92)
> > 10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
> length
> > 64
> > 13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
> > proto UDP (17), length 108)
> > 10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
> length
> > 80
> > 13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags
> [DF],
> > proto UDP (17), length 92)
> > 10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
> length
> > 64
> > 13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
> > proto UDP (17), length 92)
> > 10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
> length
> > 64
> > 13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
> > proto UDP (17), length 108)
> > 10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
> length
> > 80
> > 13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags
> [DF],
> > proto UDP (17), length 92)
> > 10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
> length
> > 64
> > 13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags
> [DF],
> > proto UDP (17), length 92)
> > 10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
> length
> > 64
> > 13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
> > proto UDP (17), length 108)
> > 10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
> length
> > 80
> > 13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags
> [DF],
> > proto UDP (17), length 92)
> > 10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
> length
> > 64
> > 13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
> > proto UDP (17), length 108)
> > 10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
> length
> > 80
> > 13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
> > proto UDP (17), length 92)
> > 10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
> length
> > 64
> > 13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags
> [DF],
> > proto UDP (17), length 92)
> > 10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
> length
> > 64
> > 13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
> > proto UDP (17), length 204)
> > 10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
> length
> > 176
> > 13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags
> [DF],
> > proto UDP (17), length 92)
> > 10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
> length
> > 64
> > 13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags
> [DF],
> > proto UDP (17), length 92)
> > 10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
> length
> > 64
> > 13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
> > proto UDP (17), length 108)
> > 10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
> length
> > 80
> >
> > and Magic, rcons:
> > ---
> > Por�lo]0;console: dbb54 [13:25]
> >
> >
> > dbb54 login:
> > ---
> >
> > On 14 April 2017 at 12:42:03, Jarrod Johnson (jjohns...@lenovo.com)
> > wrote:
> > If you ctrl-e, c, o, does it restore the console after the time?
> >
> > Can you tell that it goes after exactly 24hours on the dot?
> >
> > When console hung, does ‘ipmitool sol activate’ say ‘session
> already
> > active’?
> > Yes,
> > # ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate
> > Password:
> > Info: SOL payload already active on another session
> >
> >
> > Does /var/log/confluent/consoles/<nodename> have any interesting
> > events crop up?
> > [04/13 15:17:21 console connected]
> > … many our own messages
> > ^MMONITORING_TEST dbb54 1492160401 | <== This is the last message
> > from OS/ # date -d@1492160401 (Fri Apr 14 09:00:01 UTC 2017)
> > ^M
> > [04/14 09:05:13 console connected]
> > [04/14 09:11:59 console connected]
> > [04/14 09:13:38 console disconnected]
> > [04/14 09:14:54 console connected]
> > [04/14 10:15:13 connection by xcat_console]
> > [04/14 10:15:14 disconnection by xcat_console]
> > [04/14 13:14:30 connection by xcat_console]
> >
> >
> > Pyghmi will do keepalive as well, and if that’s the problem, it
> > should be much shorter than 24 hours. In fact, it should be
> checking
> > if the SOL payload is active and owned by confluent specifically
> > every couple of minutes.
> > yes, thats correct
> >
> >
> > From: banuchka [mailto:tyrche...@gmail.com]
> > Sent: Friday, April 14, 2017 5:55 AM
> > To: xcat-user@lists.sourceforge.net
> > Subject: Re: [xcat-user] Confluent as console server. Consoles
> hangs
> > ~after 24h.
> >
> > My last reply was incorrect. Problems still here. Im trying to find
> > something usefull inbetween confluent/pyghmi...
> > Confluent restart solves hangs/reopen all connections.
> > I think it isnt the best option to restart confluent 1 or 2 times
> in
> > 24h.
> >
> > --
> > banuchka
> > On 13 April 2017 at 17:03:19, banuchka (tyrche...@gmail.com) wrote:
> > It is Dell’s related problem, not 100% but…
> > Confluent from current master is doing things well :)
> > Thanks for pretty nice tool “confluentdbutil".
> >
> > On 13 April 2017 at 11:30:14, banuchka (tyrche...@gmail.com) wrote:
> > Looks like that problem was before… The fix was to use ipmitool
> with
> > keepalive(one from xcat repos).
> > Here pyghmi is used maybe that the reason?
> >
> > On 13 April 2017 at 08:22:28, banuchka (tyrche...@gmail.com) wrote:
> > Hi,
> >
> > Im trying to completely migrate from conserver to confluent, but
> > catch strange behaviour.
> > Some of my consoles hangs ~after 24, so no any new messages in
> their
> > logs or in rcons.
> > I send messages with timestamp from OS >/dev/console every 30-60min
> > and take a look on them for monitoring purposes(consoles
> availability
> > monitoring).
> > I can open rcons and hit enter, after few secs console is waking
> > up(strange). I didnt see it happen with conserver or maybe im
> > wrong...
> > Some details:
> > - as i can see the bigest part of consoles with hangs behaviour are
> > Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is
> in
> > use.
> > - racreset hard/ipmitool bmc reset didnt do the things
> > - hit enter to console wake it up(for example with expect i can
> send
> > \r\n\f, but it looks bad)
> > - i didnt try to clean confluent's conf and restart it. Not sure it
> > may help.
> > - HP consoles works well, same ipmi
> > - few consoles with custom pluging works good as well
> >
> > So maybe my question is not about confluent, but if some of you
> have
> > some knowledge about same problems please share it! ;)
> >
> > --
> > banuchka
> > --
> > banuchka
> > --
> > banuchka
> > -------------------------------------------------------------------
> > -----------
> > Check out the vibrant tech community on one of the world's most
> > engaging tech sites, Slashdot.org! http://sdm.link/slashdot________
> __
> > _____________________________________
> > xCAT-user mailing list
> > xCAT-user@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/xcat-user
> >
> >
> >
> > --
> > banuchka
> > --
> > banuchka
> > --
> > banuchka
> > --
> > banuchka
> >
> >
> > --
> > banuchka
> > --
> > banuchka
> > --
> > banuchka
> > --
> > banuchka
> > --
> > banuchka
>
> --
> banuchka
> --
> banuchka
> --
> banuchka
> --
> banuchka
--
banuchka
--
banuchka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user