Smalyshev added a comment.

So I took a look at our logs from Sep 29 with http://gceasy.io/ and I got the following conclusions (could be totally misguided, please tell me if so):

  • The log file covers about 5 to 6 hours. I think this should be enough but maybe if there are daily patterns it could be useful to have a larger data set to look integratively on. We could do this probably by bumping rotated file size to ~100M from 20M.
  • GC consumes 7-8% of CPU time. Have no idea if it's normal for Java service or not.
  • There's four long pause GC events I found in the logs for the last day. One took 40s. From what I see, three of them were in quick succession (around 3:08 am) - which indicates the same failure scenario I alluded to before - instead of failing the allocation, JVM starts obsessively GCing, trying to eke out that last byte. This is the failure mode I am concerned the most about. Not sure what's the alternative here - as I understand, Java has no safe out of memory failure mode? It's either slow death by GC or fast death by OOM?
  • Cross-referencing these events with WDQS dashboards shows that these events blocked updates, causing several minute lag in updates. Which means the server was essentially in coma for several minutes, but then recovered. Good news is it did recover, bad news is it should never happen in the first place.
  • Heap usage after GC is reported just under 6G. I think that means making heap size less than 6G is asking for trouble. There are several peaks reaching 12G.
  • GC times mostly stay under 5sec, but there are many over 1sec, it probably would be a good idea to keep them down.
  • Old Gen size stays mostly between 5-6G, which seems to indicate that is what long-term Blazegraph data structures need (unless I misunderstand what Old Gen means). There are several spikes, which I suspect are due to long-running queries that take so much time that normally transient structures get promoted to old gen, but I may be wrong here. This conclusion also is supported by the fact that the peaks are generally short-lived (which I interpret as query finally finishing or timing out and its memory allocation being released).
  • Young gen size goes from 6-7G before GC to under 1G after GC. That means there's probably some space for reducing the heap size if we want to have more frequent but less costly cleanups.
  • Not sure why our main GC cause is "GCLocker Initiated GC". Maybe because Blazegraph uses Java native process heap for its internal memory manager, instead of Java object heap, so each time it does something there it's a JNI call?

Some ideas:

  • Can we play with -XX:MaxTenuringThreshold and see if it can get any better results? Old gen size shows some volatility, which IMHO may mean some objects which are not supposed to be "old" get into the "old" space. From my understanding of how "old" is supposed to work, Old gen space should be more stable size-wise?
  • Should we adjust -Xmn and/or -XX:NewRatio to maybe give more space to old gen without growing young gen space too much?
  • Should we set -XX:MaxGCPauseMillis? I know it's not a hard limit but maybe hinting the VM that 40s GC is not OK would help?
  • Should we adjust XX:InitiatingHeapOccupancyPercent so that GC would try to collect young gen sooner, thus potentially resulting in shorter GCs?

TASK DETAIL
https://phabricator.wikimedia.org/T175919

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Stashbot, gerritbot, Smalyshev, Gehel, Aklapper, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, EBjune, merbst, Avner, Lewizho99, Maathavan, debt, Jonas, FloNight, Xmlizer, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to