W're running several instances of Tomcat 8.5.20 / JDK8.144 / CentOs7 on our
company for various web sites in many hosts. Recently I’m trying to understand
a performance problem we’re having on our e-commerce web site.
The configuration is the following
HAProxy <—> 2x Tomcat 8.5.20 <—> JBoss 5.1 EJB <—> Postgres 9.6
Tomcat runs a web site built with Struts / Freemarker that does call JBoss EJBs
Monitoring a specific task (putting a product on the cart) I see the following :
- with a fresh started tomcat instance, the time it takes is around 0,8
seconds. Most of the time is spent on the two RMI calls the task does.
- with an instance that is running from some time, the time can reach 2/3
seconds; occasionally 5/6 seconds. Most time is still spent on RMI calls. I.e.
what slows down are the RMI calls.
- restarting the jvm fixes the issue
- ***it seems*** but I’m still testing this since it seems there’s no
‘meatspace gc trigger command available', that when Metaspace is garbage
collected, tomcat then performs like a fresh instance.
Since we’re using more than one tomcat instance (2 in production for this
website, 1 for development) I can see that the issue is isolated to Tomcat or
the JVM/Host where it runs because other Tomcat instances behave well at the
same time one is slow. The same JBoss/Postgres backend is used by java batches
and fat clients and it does work well with consistent times.
To clarify: the moment one production tomcat that is running from some time
finishes the task in 3 seconds, the development tomcat or a fresh started
production tomcat instance does the same task in less that one second. Note
that repeating the task gives always consistent results, i.e. the instance is
running from some time is always slow, the fresh running instance is always
Tomcat is running with these VM options:
-Xms20480m -Xmx20480m -Dsun.rmi.dgc.client.gcInterval=3600000
-XX:+PrintGCTimeStamps -XX:+UseG1GC -XX:ReservedCodeCacheSize=1g
-XX:InitialCodeCacheSize=256m -XX:+UseHugeTLBFS -XX:MetaspaceSize=1g
Some of the options have been recently added (for example the increase in code
cache size) but it seems they had no impact on the issue.
Metaspace goes up to 1,6GB before being collected. Value after garbage collect
is around 200MB. Heap usage is variable, it usually stays under 10G and is
around 1G after garbage collect.
CPU usage rarely goes over 10%. Loaded classes between 20k and 40k. Active
sessions around 100/120 for each instance.
Any help or direction to understand what’s causing this is greatly appreciated.
Ing. Andrea Vettori