[ 
https://issues.apache.org/jira/browse/ACCUMULO-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved ACCUMULO-3336.
----------------------------------
       Resolution: Cannot Reproduce
    Fix Version/s:     (was: 1.8.1)

This issue is getting far too much traction from users who think that this is a 
bug.

The issue I reported here is not a simple SESSION_EXPIRED from ZooKeeper. 
Unless you really understand what's going on, you need to tune the 
configuration of Accumulo, HDFS or the Operation System. This is not a "bug".

> ZK session reconnect still results in loss of ZK lock
> -----------------------------------------------------
>
>                 Key: ACCUMULO-3336
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3336
>             Project: Accumulo
>          Issue Type: Bug
>          Components: zookeeper
>    Affects Versions: 1.5.2, 1.6.1
>            Reporter: Josh Elser
>
> Saw the following
> {noformat}
> 2014-11-14 08:38:30,612 [zookeeper.DistributedWorkQueue] INFO : Got 
> unexpected zookeeper event: None for 
> /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:30,621 [zookeeper.ZooCache] WARN : Saw (possibly) transient 
> exception communicating with ZooKeeper, will retry
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/config/tserver.compaction.warn.time
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>       at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
>       at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260)
>       at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157)
>       at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285)
>       at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232)
>       at 
> org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:96)
>       at 
> org.apache.accumulo.server.conf.ZooConfiguration._get(ZooConfiguration.java:65)
>       at 
> org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:90)
>       at 
> org.apache.accumulo.core.conf.AccumuloConfiguration.getTimeInMillis(AccumuloConfiguration.java:136)
>       at 
> org.apache.accumulo.tserver.CompactionWatcher.run(CompactionWatcher.java:84)
>       at 
> org.apache.accumulo.server.util.time.SimpleTimer$LoggingTimerTask.run(SimpleTimer.java:42)
>       at java.util.TimerThread.mainLoop(Timer.java:555)
>       at java.util.TimerThread.run(Timer.java:505)
> 2014-11-14 08:38:30,672 [zookeeper.DistributedWorkQueue] INFO : Got 
> unexpected zookeeper event: None for 
> /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:30,672 [zookeeper.ZooLock] DEBUG: event null None 
> Disconnected
> 2014-11-14 08:38:31,484 [zookeeper.ZooReader] WARN : Saw (possibly) transient 
> exception communicating with ZooKeeper
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired for 
> /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tservers/ip-172-31-13-177:37709
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>       at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
>       at 
> org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java:109)
>       at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:381)
>       at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
>       at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2014-11-14 08:38:31,484 [zookeeper.ZooCache] WARN : Saw (possibly) transient 
> exception communicating with ZooKeeper, will retry
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired for 
> /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tables/!0/namespace
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>       at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
>       at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260)
>       at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157)
>       at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285)
>       at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232)
>       at 
> org.apache.accumulo.core.client.impl.Tables.getNamespaceId(Tables.java:304)
>       at 
> org.apache.accumulo.server.conf.TableParentConfiguration.getNamespaceId(TableParentConfiguration.java:47)
>       at 
> org.apache.accumulo.server.conf.NamespaceConfiguration.getPath(NamespaceConfiguration.java:85)
>       at 
> org.apache.accumulo.server.conf.NamespaceConfiguration.get(NamespaceConfiguration.java:98)
>       at 
> org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:107)
>       at 
> org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:103)
>       at 
> org.apache.accumulo.core.conf.AccumuloConfiguration.getCount(AccumuloConfiguration.java:193)
>       at 
> org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServer.java:2636)
>       at 
> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>       at java.lang.Thread.run(Thread.java:745)
> 2014-11-14 08:38:31,484 [zookeeper.Retry] DEBUG: Sleeping for 250ms before 
> retrying operation
> 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Removing closed 
> ZooKeeper session to localhost:12644
> 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Connecting to 
> localhost:12644 with timeout 30000 with auth
> 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Removing closed 
> ZooKeeper session to localhost:12644
> 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Connecting to 
> localhost:12644 with timeout 30000 with auth
> 2014-11-14 08:38:31,692 [tserver.TabletServer] DEBUG: gc ParNew=0.10(+0.04) 
> secs ConcurrentMarkSweep=0.05(+0.00) secs freemem=118,013,904(+6,412,200) 
> totalmem=129,761,280
> 2014-11-14 08:38:31,692 [tserver.TabletServer] WARN : GC pause checker not 
> called in a timely fashion. Expected every 5.0 seconds but was 43.1 seconds 
> since last check
> 2014-11-14 08:38:31,700 [zookeeper.DistributedWorkQueue] DEBUG: Looking for 
> work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:31,701 [zookeeper.DistributedWorkQueue] DEBUG: Looking for 
> work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:31,715 [tserver.TabletServer] DEBUG: ScanSess tid 
> 172.31.13.177:35935 !0 1 entries in 0.03 secs, nbTimes = [24 24 24.00 1] 
> 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Scanning trace hosts in 
> zookeeper: /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tracers
> 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Trace hosts: []
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got 
> unexpected zookeeper event: None for 
> /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/replication/workqueue
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got 
> unexpected zookeeper event: None for 
> /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got 
> unexpected zookeeper event: None for 
> /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:31,739 [zookeeper.ZooSession] DEBUG: Session expired, state 
> of current session : Expired
> 2014-11-14 08:38:31,739 [zookeeper.ZooLock] DEBUG: event null None Expired
> 2014-11-14 08:38:31,741 [tserver.TabletServer] FATAL: Lost tablet server lock 
> (reason = SESSION_EXPIRED), exiting.
> {noformat}
> ZooKeeper code appears to had disconnected, closed the disconnected 
> connection and then opened a new session. However, the ZooLock, IIRC, didn't 
> reconnect and hung the tserver.
> If we want to support this, it might require rehashing some of the ZooLock 
> code (to prevent the tserver from processing while the tserver doesn't have 
> its lock).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to