We have been slowly been trying to squash stability issues with gmetad. Many of the symptoms seem to relate to sockets ending up in CLOSE_WAIT, although I'm unsure if that is a useful clue or a side effect. The user visible problem is metrics not getting updated and TN climbing for an entire cluster (or several clusters). We had some success with patches and great udp buffer tunning in [1] but the problem has not totally gone away and we have tried to track occasional updates at [2].
We run a pair of (3.6.0) gmetad's polling the same gmonds and used that to do some additional experiments focusing on the user perceived problem. In short only one of them experienced elevated TN (ie load_one > 300) the other does not (meaning the problem is likely with gmetad and not gmond). We also captured lsof/pstack output and noticed that: * A connection to the relevant cluster was stuck in CLOSE_WAIT * There were several threads that appeared to be waiting on a pthreads consturct [3] While googling we found a very similar issue from 2009 that specifically suggested that hash_insert might be the problem (we had been focusing on libexpat internally). Note that unlike [2] and [4] this problem lasts 10-15 minutes instead of indefinitely. Due to the similarity of symptoms and code I suspect they are related. Is anyone still experiencing similar problems? If so, any ideas? [1] http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg07715.html [2] https://github.com/ganglia/monitor-core/issues/47 [3] Thread 7 (Thread 0x7f1e2d7fb700 (LWP 3329)): #0 0x0000003616e0b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000031eb20c38b in pthread_rdwr_wlock_np () from /usr/lib64/libganglia-3.6.0.so.0 #2 0x00000031eb20b7be in hash_insert () from /usr/lib64/libganglia-3.6.0.so.0 #3 0x0000000000405c78 in end () #4 0x00000037f520a538 in ?? () from /lib64/libexpat.so.1 #5 0x00000037f520b8ce in ?? () from /lib64/libexpat.so.1 #6 0x00000037f520d1fa in ?? () from /lib64/libexpat.so.1 #7 0x00000037f520db2b in ?? () from /lib64/libexpat.so.1 #8 0x00000037f5203f82 in XML_ParseBuffer () from /lib64/libexpat.so.1 #9 0x0000000000405b1b in process_xml () #10 0x0000000000404603 in data_thread () #11 0x0000003616e077f1 in start_thread () from /lib64/libpthread.so.0 #12 0x00000037f0ae5ccd in clone () from /lib64/libc.so.6 [4] http://www.mail-archive.com/ganglia-developers@lists.sourceforge.net/msg05063.html ------------------------------------------------------------------------------ See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk _______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general