Hi Paul, Many thanks for your contribution. I'll certainly look into Zabbix, although I must confess to being aghast at what appears to be a large and complex tool for what I'd hoped was quite simple. I hadn't realised these servers were so temperamental. Before I loose myself in getting acquainted with a new sophisticated product, could you tell me whether Zabbix (or something else) will help me identify the following?: - When are users suffering timeouts (doesn't have to be real time, happy to check summary later) - Where was the timeout occuring (network, Apache, Tomcat, Postgres) - What was the cause of the timeout (too many connections, low memory, long Java operation, long query, etc) - What specific item (Java program, DB query) was responsible I wonder whether all this should be discoverable in the logs, with the right configuration.
I've seen a lot of mention of JMX for Tomcat monitoring, but I've shied away from it since I wanted to start simple, but perhaps there is no simple ... ;-( ________________________________________ From: Paul Libbrecht [p...@hoplahup.net] Sent: 01 November 2014 09:41 To: XWiki Users Subject: Re: [xwiki-users] Monitoring an Xwiki stack Hello all, Here's my experience at monitoring XWikis. With i2geo.net and with my private XWiki, I use a zabbix server. This php-based monitoring tool is quite easy to configure for http monitoring and with a few more steps you get a mail notification when, e.g., a timeout occurs in connections. I've been using HypericHQ for a while, a java based monitoring, which was rather nice to manipulate but a machine-name-change broke everything, so I looked for something a tick more modern. At curriki.org, a site with lots of visitors, there's quite a few tools used to monitor. - First, for the safety and honesty of a system outside, alertsite.com is used. It is very effective at detecting breakges, including potential internet backbones'. We use monitoring from three locations. - Second, because, indeed, the XWiki servers sometimes need a push, there used to be a regular script that checks a basic page and, if failed, auto-restarts the app-server. For us, this is a bit unsafe because we like to control things after a restart. - Third, for a while, we have been running a "combined monitoring" which allowed to combine a small graphical view synced with logs of apache, the app-server, thread-dumps, and mysql. This allowed to catch "bad actions" which sometimes happen when power users perform actions which trigger too big queries which locked others (group-deletions were such an action). - Finally, we also added a zabbix which collects http monitoring as well as other "classical" values (disks, memory, apache-stats, …). The rhythm at curriki is about a week… after a week, one of the two cluster nodes (there's two currently) needs a restart because some memory gets exhausted and the GC starts to fail. We generally get alertsite errors then. The interest of running a monitoring infrastructure such as zabbix, is that you can analyze the behaviors of multiple variables and see if there is a way to predict if things are getting wrong. It remains a guts' feeling story but still gives you quite some confidence. It would be really nice if we could converge on a set of JMX analysis "items" for zabbix so that we could be analyzing more concretely the xwiki-relevant information (in particular the cache behaviors) and start adjusting to less fall out of memory. paul _______________________________________________ users mailing list users@xwiki.org http://lists.xwiki.org/mailman/listinfo/users