Just to tie everything together in a confusing way, this may be
similar/related to https://github.com/ganglia/monitor-core/issues/47

On 04/16/2013 08:30 AM, Chris Burroughs wrote:
> This sounds a lot like a problem I have been having once a week or so:
> https://github.com/ganglia/monitor-core/issues/97
> 
> I have a reference to 246193:Apr 14 23:59:02 lsu02
> /usr/sbin/gmetad[25897]: Process XML (LAX Tiggr): XML_ParseBuffer()
> error at line 75498: no element found
> 
> in syslog.  But I can't be sure at that timestamp lines up with when the
> interactive port stopped working.  I tried increasing the number of
> server_threads but (anecdotally) that does not appear to have helped.
> xmllint currently says everything is a-okay but I don't know what it
> looks like when the interactive port is down.
> 
> On 04/05/2013 01:22 PM, Vladimir Vuksan wrote:
>> Run the XML output through xmllint e.g. something like
>>
>>
>> nc localhost 8651 | xmllint -
>>
>> may give you hints.
>>
>> On Fri, 5 Apr 2013, Ramon Bastiaans wrote:
>>
>>> Ah. I also suspect some weird gmetric to cause this, but so far have not 
>>> been able to find it in the XML unfortunately.
>>>
>>> Well regardless of the cause, I think it should not cause the interactive 
>>> port to stop responding and for the web interface to hang.
>>>
>>> Having a quick look at the source of gmetad I was not able to find where 
>>> this might originate. Perhaps the web interface could fail back to port 
>>> 8651 if port 8652 times out.
>>>
>>> - Ramon
>>>
>>> P.S. pbs-python still alive and well. If you mean "Job Monarch" I have been 
>>> working hard recently on a new release and it is near (99%) finished. ;) 
>>> pbswebmon is a completely different project which SARA is not associated 
>>> with or has any role in.
>>>
>>>
>>> As of January 2013, SARA has a new name: SURFsara.
>>>
>>> ing. Ramon Bastiaans - Senior Systems Programmer - Cluster Computing
>>> | Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG 
>>> Amsterdam | T +31 (0)20 592 30 00 | ramon.bastia...@surfsara.nl | 
>>> www.surfsara.nl |
>>>
>>>
>>>
>>>
>>> On 4 apr. 2013, at 18:52, Chris Hunter <chris.hun...@yale.edu> wrote:
>>>
>>>> Hi,
>>>>
>>>> We have seen this before (ganglia-gmond 3.2) when there are whitespace
>>>> or non-alphanumeric characters in custom gmetrics.
>>>>
>>>> PS I hope pbs-python/pbswebmon are still active...
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> We have been experiencing a weird issue with gmetad.
>>>>>
>>>>> I am running gmetad v3.4.0
>>>>>
>>>>> Once in a while now a XML error seems to occur. Like this:
>>>>>
>>>>> /usr/sbin/gmetad[12241]: Process XML (LISA Cluster): XML_ParseBuffer() 
>>>>> error at line 525626:
>>>>>
>>>>> Besides what is causing that and why, this causing the Ganglia web front 
>>>>> end to hang and become non responsive.
>>>>>
>>>>> After checking the gmetad it seems port 8652 is no longer responding to 
>>>>> queries. This does nothing:
>>>>>
>>>>> # telnet localhost 8652
>>>>> Trying 127.0.0.1...
>>>>> Connected to localhost.
>>>>> Escape character is '^]'.
>>>>> /LISA Cluster
>>>>>
>>>>> <after about 1 minute>
>>>>> Connection closed by foreign host.
>>>>>
>>>>>
>>>>> However port 8651 still works:
>>>>>
>>>>> # telnet localhost 8651 | wc -l
>>>>> Connection closed by foreign host.
>>>>> 921410
>>>>>
>>>>> And when I switch the web frontend from port 8652 back to port 8651 
>>>>> ($conf['ganglia_port'] = 8651;), the web page responds and works again.
>>>>>
>>>>> After restarting gmetad port 8652 also becomes responsive again. It 
>>>>> almost seems gmetad has a thread lost it's way or something.
>>>>>
>>>>> Any idea what may be causing this (besides the XML error)? It seems weird 
>>>>> to me if 1 port works and the other does not anymore. It might be a bug.
>>>>>
>>>>> I have a dump of the XML (from port 8651 before restarting) available for 
>>>>> who might want it, but it is 42 MB.
>>>>>
>>>>>
>>>>> Kind regards,
>>>>> - Ramon.
>>>>>
>>>>> As of January 2013, SARA has a new name: SURFsara.
>>>>>
>>>>> ing. Ramon Bastiaans - Senior Systems Programmer - Cluster Computing
>>>>> | Operations, Support & Development | SURFsara | Science Park 140 | 1098 
>>>>> XG Amsterdam | T +31 (0)20 592 30 00 | ramon.bastia...@surfsara.nl | 
>>>>> www.surfsara.nl |
>>>> =
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Minimize network downtime and maximize team effectiveness.
>>>> Reduce network management and security costs.Learn how to hire
>>>> the most talented Cisco Certified professionals. Visit the
>>>> Employer Resources Portal
>>>> http://www.cisco.com/web/learning/employer_resources/index.html
>>>> _______________________________________________
>>>> Ganglia-developers mailing list
>>>> Ganglia-developers@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/ganglia-developers
>>>
>>>
>>
>> ------------------------------------------------------------------------------
>> Minimize network downtime and maximize team effectiveness.
>> Reduce network management and security costs.Learn how to hire 
>> the most talented Cisco Certified professionals. Visit the 
>> Employer Resources Portal
>> http://www.cisco.com/web/learning/employer_resources/index.html
>> _______________________________________________
>> Ganglia-developers mailing list
>> Ganglia-developers@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/ganglia-developers
>>
> 


------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to