Just to tie everything together in a confusing way, this may be similar/related to https://github.com/ganglia/monitor-core/issues/47
On 04/16/2013 08:30 AM, Chris Burroughs wrote: > This sounds a lot like a problem I have been having once a week or so: > https://github.com/ganglia/monitor-core/issues/97 > > I have a reference to 246193:Apr 14 23:59:02 lsu02 > /usr/sbin/gmetad[25897]: Process XML (LAX Tiggr): XML_ParseBuffer() > error at line 75498: no element found > > in syslog. But I can't be sure at that timestamp lines up with when the > interactive port stopped working. I tried increasing the number of > server_threads but (anecdotally) that does not appear to have helped. > xmllint currently says everything is a-okay but I don't know what it > looks like when the interactive port is down. > > On 04/05/2013 01:22 PM, Vladimir Vuksan wrote: >> Run the XML output through xmllint e.g. something like >> >> >> nc localhost 8651 | xmllint - >> >> may give you hints. >> >> On Fri, 5 Apr 2013, Ramon Bastiaans wrote: >> >>> Ah. I also suspect some weird gmetric to cause this, but so far have not >>> been able to find it in the XML unfortunately. >>> >>> Well regardless of the cause, I think it should not cause the interactive >>> port to stop responding and for the web interface to hang. >>> >>> Having a quick look at the source of gmetad I was not able to find where >>> this might originate. Perhaps the web interface could fail back to port >>> 8651 if port 8652 times out. >>> >>> - Ramon >>> >>> P.S. pbs-python still alive and well. If you mean "Job Monarch" I have been >>> working hard recently on a new release and it is near (99%) finished. ;) >>> pbswebmon is a completely different project which SARA is not associated >>> with or has any role in. >>> >>> >>> As of January 2013, SARA has a new name: SURFsara. >>> >>> ing. Ramon Bastiaans - Senior Systems Programmer - Cluster Computing >>> | Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG >>> Amsterdam | T +31 (0)20 592 30 00 | ramon.bastia...@surfsara.nl | >>> www.surfsara.nl | >>> >>> >>> >>> >>> On 4 apr. 2013, at 18:52, Chris Hunter <chris.hun...@yale.edu> wrote: >>> >>>> Hi, >>>> >>>> We have seen this before (ganglia-gmond 3.2) when there are whitespace >>>> or non-alphanumeric characters in custom gmetrics. >>>> >>>> PS I hope pbs-python/pbswebmon are still active... >>>> >>>> >>>>> Hi, >>>>> >>>>> We have been experiencing a weird issue with gmetad. >>>>> >>>>> I am running gmetad v3.4.0 >>>>> >>>>> Once in a while now a XML error seems to occur. Like this: >>>>> >>>>> /usr/sbin/gmetad[12241]: Process XML (LISA Cluster): XML_ParseBuffer() >>>>> error at line 525626: >>>>> >>>>> Besides what is causing that and why, this causing the Ganglia web front >>>>> end to hang and become non responsive. >>>>> >>>>> After checking the gmetad it seems port 8652 is no longer responding to >>>>> queries. This does nothing: >>>>> >>>>> # telnet localhost 8652 >>>>> Trying 127.0.0.1... >>>>> Connected to localhost. >>>>> Escape character is '^]'. >>>>> /LISA Cluster >>>>> >>>>> <after about 1 minute> >>>>> Connection closed by foreign host. >>>>> >>>>> >>>>> However port 8651 still works: >>>>> >>>>> # telnet localhost 8651 | wc -l >>>>> Connection closed by foreign host. >>>>> 921410 >>>>> >>>>> And when I switch the web frontend from port 8652 back to port 8651 >>>>> ($conf['ganglia_port'] = 8651;), the web page responds and works again. >>>>> >>>>> After restarting gmetad port 8652 also becomes responsive again. It >>>>> almost seems gmetad has a thread lost it's way or something. >>>>> >>>>> Any idea what may be causing this (besides the XML error)? It seems weird >>>>> to me if 1 port works and the other does not anymore. It might be a bug. >>>>> >>>>> I have a dump of the XML (from port 8651 before restarting) available for >>>>> who might want it, but it is 42 MB. >>>>> >>>>> >>>>> Kind regards, >>>>> - Ramon. >>>>> >>>>> As of January 2013, SARA has a new name: SURFsara. >>>>> >>>>> ing. Ramon Bastiaans - Senior Systems Programmer - Cluster Computing >>>>> | Operations, Support & Development | SURFsara | Science Park 140 | 1098 >>>>> XG Amsterdam | T +31 (0)20 592 30 00 | ramon.bastia...@surfsara.nl | >>>>> www.surfsara.nl | >>>> = >>>> >>>> ------------------------------------------------------------------------------ >>>> Minimize network downtime and maximize team effectiveness. >>>> Reduce network management and security costs.Learn how to hire >>>> the most talented Cisco Certified professionals. Visit the >>>> Employer Resources Portal >>>> http://www.cisco.com/web/learning/employer_resources/index.html >>>> _______________________________________________ >>>> Ganglia-developers mailing list >>>> Ganglia-developers@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/ganglia-developers >>> >>> >> >> ------------------------------------------------------------------------------ >> Minimize network downtime and maximize team effectiveness. >> Reduce network management and security costs.Learn how to hire >> the most talented Cisco Certified professionals. Visit the >> Employer Resources Portal >> http://www.cisco.com/web/learning/employer_resources/index.html >> _______________________________________________ >> Ganglia-developers mailing list >> Ganglia-developers@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/ganglia-developers >> > ------------------------------------------------------------------------------ Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr _______________________________________________ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers