Hi Paul,

Many thanks for your contribution. I'll certainly look into Zabbix, although I 
must confess to being aghast at what appears to be a large and complex tool for 
what I'd hoped was quite simple. I hadn't realised these servers were so 
temperamental. Before I loose myself in getting acquainted with a new 
sophisticated product, could you tell me whether Zabbix (or something else) 
will help me identify the following?:
- When are users suffering timeouts (doesn't have to be real time, happy to 
check summary later)
- Where was the timeout occuring (network, Apache, Tomcat, Postgres)
- What was the cause of the timeout (too many connections, low memory, long 
Java operation, long query, etc)
- What specific item (Java program, DB query) was responsible
I wonder whether all this should be discoverable in the logs, with the right 
configuration.

I've seen a lot of mention of JMX for Tomcat monitoring, but I've shied away 
from it since I wanted to start simple, but perhaps there is no simple ... ;-(

________________________________________
From: Paul Libbrecht [p...@hoplahup.net]
Sent: 01 November 2014 09:41
To: XWiki Users
Subject: Re: [xwiki-users] Monitoring an Xwiki stack

Hello all,

Here's my experience at monitoring XWikis.

With i2geo.net and with my private XWiki, I use a zabbix server.
This php-based monitoring tool is quite easy to configure for http monitoring 
and with a few more steps you get a mail notification when, e.g., a timeout 
occurs in connections.
I've been using HypericHQ for a while, a java based monitoring, which was 
rather nice to manipulate but a machine-name-change broke everything, so I 
looked for something a tick more modern.

At curriki.org, a site with lots of visitors, there's quite a few tools used to 
monitor.
- First, for the safety and honesty of a system outside, alertsite.com is used. 
It is very effective at detecting breakges, including potential internet 
backbones'. We use monitoring from three locations.
- Second, because, indeed, the XWiki servers sometimes need a push, there used 
to be a regular script that checks a basic page and, if failed, auto-restarts 
the app-server. For us, this is a bit unsafe because we like to control things 
after a restart.
- Third, for a while, we have been running a "combined monitoring" which 
allowed to combine a small graphical view synced with logs of apache, the 
app-server, thread-dumps, and mysql. This allowed to catch "bad actions" which 
sometimes happen when power users perform actions which trigger too big queries 
which locked others (group-deletions were such an action).
- Finally, we also added a zabbix which collects http monitoring as well as 
other "classical" values (disks, memory, apache-stats, …).
The rhythm at curriki is about a week… after a week, one of the two cluster 
nodes (there's two currently) needs a restart because some memory gets 
exhausted and the GC starts to fail. We generally get alertsite errors then.

The interest of running a monitoring infrastructure such as zabbix, is that you 
can analyze the behaviors of multiple variables and see if there is a way to 
predict if things are getting wrong. It remains a guts' feeling story but still 
gives you quite some confidence.

It would be really nice if we could converge on a set of JMX analysis "items" 
for zabbix so that we could be analyzing more concretely the xwiki-relevant 
information (in particular the cache behaviors) and start adjusting to less  
fall out of memory.

paul
_______________________________________________
users mailing list
users@xwiki.org
http://lists.xwiki.org/mailman/listinfo/users

Reply via email to