We have been slowly been trying to squash stability issues with gmetad. 
  Many of the symptoms seem to relate to sockets ending up in 
CLOSE_WAIT, although I'm unsure if that is a useful clue or a side 
effect.  The user visible problem is metrics not getting updated and TN 
climbing for an entire cluster (or several clusters).  We had some 
success with patches and great udp buffer tunning in [1] but the problem 
has not totally gone away and we have tried to track occasional updates 
at [2].

We run a pair of (3.6.0) gmetad's polling the same gmonds and used that 
to do some additional experiments focusing on the user perceived 
problem.  In short only one of them experienced elevated TN (ie load_one 
 > 300) the other does not (meaning the problem is likely with gmetad 
and not gmond).  We also captured lsof/pstack output and noticed that:
  * A connection to the relevant cluster was stuck in CLOSE_WAIT
  * There were several threads that appeared to be waiting on a pthreads 
consturct [3]

While googling we found a very similar issue from 2009 that specifically 
suggested that hash_insert might be the problem (we had been focusing on 
libexpat internally).   Note that unlike [2] and [4] this problem lasts 
10-15 minutes instead of indefinitely.  Due to the similarity of 
symptoms and code I suspect they are related.

Is anyone still experiencing similar problems?  If so, any ideas?


[1] 
http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg07715.html

[2] https://github.com/ganglia/monitor-core/issues/47

[3]

Thread 7 (Thread 0x7f1e2d7fb700 (LWP 3329)):
#0  0x0000003616e0b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
#1  0x00000031eb20c38b in pthread_rdwr_wlock_np () from 
/usr/lib64/libganglia-3.6.0.so.0
#2  0x00000031eb20b7be in hash_insert () from 
/usr/lib64/libganglia-3.6.0.so.0
#3  0x0000000000405c78 in end ()
#4  0x00000037f520a538 in ?? () from /lib64/libexpat.so.1
#5  0x00000037f520b8ce in ?? () from /lib64/libexpat.so.1
#6  0x00000037f520d1fa in ?? () from /lib64/libexpat.so.1
#7  0x00000037f520db2b in ?? () from /lib64/libexpat.so.1
#8  0x00000037f5203f82 in XML_ParseBuffer () from /lib64/libexpat.so.1
#9  0x0000000000405b1b in process_xml ()
#10 0x0000000000404603 in data_thread ()
#11 0x0000003616e077f1 in start_thread () from /lib64/libpthread.so.0
#12 0x00000037f0ae5ccd in clone () from /lib64/libc.so.6


[4] 
http://www.mail-archive.com/ganglia-developers@lists.sourceforge.net/msg05063.html

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to