I have not tried the G1 gc yet but it does look like it is production ready according to Oracle.
You can use jstat to monitor gc of a tserver to see if gc really is the issue for the pauses. My usual gc related options for tservers are -XX:NewSize=2G -XX:MaxNewSize=2G -XX:MaxPermSize=512m -XX:CMSInitiatingOccupancyFraction=50 -XX+UseParNewGC -XX:SurvivorRatio=6 -XX:ParallelGCThreads=16 -XX:ConGCThreads=8 -XX:+UseCondCardMark -XX:+UnlockDiagnosticVMOptions -XX:ParGCCardsPerStrideChunk=4096 -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled If you are doing a lot of ingesting via batch writes (which the Upsess implies), you might consider increasing tserver.walog.max.size to 2G instead of 1G (but doing so will cause the loss of more data if a tserver dies). The troubleshooting <https://github.com/apache/accumulo/blob/master/docs/src/main/asciidoc/chapters/troubleshooting.txt> documentation with accumulo is helpful in finding latency issues too. -- Jeff Kubina 410-988-4436 On Thu, Oct 13, 2016 at 10:49 AM, Noe Detore <ndet...@minerkasch.com> wrote: > Yes, seeing a lot of DEBUG:Upsess. Also seeing > [server.GarbageCollectionLogger] > DEBUG: gc ParNew=64.69(+1.24) secs ConcurrentMarkSweep=102.51(+0.06) secs > freemem=4,844,821,808(-20,292,780,896) totalmem=25,525,551,104 > 2016-10-13 11:22:17,963 [zookeeper.ZooLock] DEBUG: event null None > Disconnected > > During hotspot seems like a java gc pause is causing zk heart beat to miss > and then expire. Are there recommend java gc configurations? We are using > native memory. Would trying G1 gc be advised? > > Thank you > > On Fri, Oct 7, 2016 at 8:23 PM, Jeff Kubina <jeff.kub...@gmail.com> wrote: > >> Noe, >> >> Do you have a lot (1000s) of "[tserver.TableServer] DEBUG: UpSess ..." >> messages in your tserver logs prior to the FATAL or "ERROR: Lost tablet >> server lock" error message? >> >> Jeff >> >> >> -- >> Jeff Kubina >> 410-988-4436 >> >> >> On Fri, Oct 7, 2016 at 10:34 AM, Noe Detore <ndet...@minerkasch.com> >> wrote: >> >>> Any updates on this issue https://issues.apache.org/jira >>> /browse/ACCUMULO-3336 ? I am seeing this behavior using 1.7.2 on one of >>> our clusters. Not seeing on other clusters, but what could be some causes? >>> Swap on server looks good as there is none. Are there particular >>> configurations to adjust? >>> >>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>> KeeperErrorCode = Session expired ... >>> 2016-10-06 23:22:30,633 [zookeeper.DistributedWorkQueue] INFO : Got >>> unexpected zookeeper event: None for ... >>> 2016-10-06 23:22:30,679 [tserver.TabletServer] ERROR: Lost tablet server >>> lock (reason = SESSION_EXPIRED), exiting >>> >>> Thanks >>> Noe >>> >> >> >