Was there actually an 11 second delay in the tserver's debug log (2:00:51 to 2:01:02) or did you omit some log statements?

The log messages in your original email also showed MultiScanSession(s) immediately before the ZK lock lost.

Can you give us any information about the type of query workload you're servicing here? A MultiScanSession is the equivalent to a "piece" of a BatchScanner running against a tserver. Are you doing any sort of heavy workload in an SortedKeyValueIterator running on these tservers?

On 11/12/13, 9:36 AM, buttercream wrote:
I increased all of the servers up to 32GB of memory and confirmed that I have
the flags that you mentioned in the env file. Unfortunately within a day I
lost one of the tservers. In the tserver logs, looking at the timestamps
leading up to the event, I see:
02:00:03,835 [cache.LruBlockCache]
02:00:51,580 [tabletserver.TabletServer] DEBUG: MultiScanSess
02:01:02,267 [tabletserver.TabletServer] FATAL: Lost tablet server lock
(reason = LOCK_DELETED), exiting.

What's interesting on this one is that in the master log file, there is no
error message at that time. What I do see is this:
02:01:02,168 [master.Master] DEBUG: Finished gathering information from 2
servers in 0.01 seconds

That would mean the tserver killed itself within milliseconds of the master
getting the information successfully. Any thoughts on this one?



--
View this message in context: 
http://apache-accumulo.1065345.n5.nabble.com/Tserver-kills-themselves-from-lost-Zookeeper-locks-tp6125p6360.html
Sent from the Users mailing list archive at Nabble.com.

Reply via email to