[jira] [Commented] (ACCUMULO-3898) `admin stop` created deadlock in TabletServer being stopped

2019-05-02 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/ACCUMULO-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831837#comment-16831837
 ] 

Josh Elser commented on ACCUMULO-3898:
--

Haven't tested it.

> `admin stop` created deadlock in TabletServer being stopped
> ---
>
> Key: ACCUMULO-3898
> URL: https://issues.apache.org/jira/browse/ACCUMULO-3898
> Project: Accumulo
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.7.0
>Reporter: Josh Elser
>Priority: Critical
> Attachments: 0001-ACCUMULO-3898-Test-to-reproduce.patch
>
>
> Take a tabletserver, tserver1, one of a few.
> The other tabletservers have already been requested to stop and all have 
> stopped (via {{admin stop}}).
> {noformat}
> 2015-06-10 21:07:40,485 [master.Master] DEBUG: Seeding FATE op to shutdown 
> tserver1:9997[24ddf314d8c0010] with tid 9215266058859846639
> 2015-06-10 21:07:40,503 [master.EventCoordinator] INFO : Tablet Server 
> shutdown requested for tserver1:9997[24ddf314d8c0010]
> {noformat}
> The master FATE op gets seeded to shutdown tserver1. As we can guess, since 
> it's the last server, it's hosting {{r<<}}. It unloads it
> {noformat}
> 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
> queueMinC=false disableWrites=false) +r<<
> 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: completeClose(saveState=true 
> completeClose=true) +r<<
> 2015-06-10 21:07:40,587 [tserver.TabletServer] DEBUG: MultiScanSess 
> master:58287 1 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
> 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
> 0x04df6fa0
> 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
> 0x04df7040
> 2015-06-10 21:07:40,609 [tserver.NativeMap] DEBUG: Deallocating native map 
> 0x022b4da0
> 2015-06-10 21:07:40,609 [tablet.Tablet] TABLET_HIST: +r<< closed
> 2015-06-10 21:07:40,609 [tserver.TabletServer] DEBUG: Unassigning 
> +r<<@(null,tserver1:9997[24ddf314d8c0010],null)
> 2015-06-10 21:07:40,610 [state.ZooStore] DEBUG: Removing /root_tablet/location
> 2015-06-10 21:07:40,613 [state.ZooStore] DEBUG: Removing 
> /root_tablet/future_location
> 2015-06-10 21:07:40,614 [state.ZooTabletStateStore] DEBUG: unassign root 
> tablet location
> 2015-06-10 21:07:40,619 [tserver.TabletServer] INFO : unloaded +r<<
> {noformat}
> After this, however, we can see it tries to run a minor compaction on a user 
> tablet (1<<)
> {noformat}
> 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
> queueMinC=false disableWrites=false) !0;~<
> 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: completeClose(saveState=true 
> completeClose=true) !0;~<
> 2015-06-10 21:07:40,622 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
> queueMinC=false disableWrites=false) 1<<
> 2015-06-10 21:07:40,623 [tserver.NativeMap] DEBUG: Allocated native map 
> 0x02e82e00
> 2015-06-10 21:07:40,624 [impl.ThriftScanner] DEBUG: Failed to locate tablet 
> for table : +r row : !0;
> 2015-06-10 21:07:40,633 [tablet.MinorCompactor] DEBUG: Begin minor compaction 
> hdfs://namenode:8020/apps/accumulo/data/tables/1/default_tablet/F007.rf_tmp
>  1<<
> 2015-06-10 21:07:40,759 [tablet.Compactor] DEBUG: Compaction 1<< 590 read | 
> 590 written | 12,553 entries/sec |  0.047 secs
> 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs for memory compacted: 1<< 
> tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
> 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs to be destroyed: 1<< 
> tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
> 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: 
> BEFORE compactionThreshold = 0.550 maxObserved = 150,256
> 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER 
> compactionThreshold = 0.605
> {noformat}
> So, as we might expect, this is never going to finish. The close won't happen 
> because root isn't assigned, and the master isn't going to reassign root 
> unless it gets a new tserver (which it won't because we're trying to stop the 
> entire instance).
> To verify this is what's gumming up the works:
> {noformat}
> "tablet migration 1" daemon prio=10 tid=0x02e84800 nid=0x671 waiting 
> on condition [0x7fb80a03a000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.accumulo.core.util.UtilWaitThread.sleep(UtilWaitThread.java:27)
>   at 
> org.apache.accumulo.core.client.impl.RootTabletLocator.locateTablet(RootTabletLocator.java:129)
>   at 
> 

[jira] [Commented] (ACCUMULO-3898) `admin stop` created deadlock in TabletServer being stopped

2019-04-17 Thread Michael Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/ACCUMULO-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820400#comment-16820400
 ] 

Michael Miller commented on ACCUMULO-3898:
--

[~elserj] are you still seeing this in 1.8 or 1.9?

> `admin stop` created deadlock in TabletServer being stopped
> ---
>
> Key: ACCUMULO-3898
> URL: https://issues.apache.org/jira/browse/ACCUMULO-3898
> Project: Accumulo
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.7.0
>Reporter: Josh Elser
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: 0001-ACCUMULO-3898-Test-to-reproduce.patch
>
>
> Take a tabletserver, tserver1, one of a few.
> The other tabletservers have already been requested to stop and all have 
> stopped (via {{admin stop}}).
> {noformat}
> 2015-06-10 21:07:40,485 [master.Master] DEBUG: Seeding FATE op to shutdown 
> tserver1:9997[24ddf314d8c0010] with tid 9215266058859846639
> 2015-06-10 21:07:40,503 [master.EventCoordinator] INFO : Tablet Server 
> shutdown requested for tserver1:9997[24ddf314d8c0010]
> {noformat}
> The master FATE op gets seeded to shutdown tserver1. As we can guess, since 
> it's the last server, it's hosting {{r<<}}. It unloads it
> {noformat}
> 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
> queueMinC=false disableWrites=false) +r<<
> 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: completeClose(saveState=true 
> completeClose=true) +r<<
> 2015-06-10 21:07:40,587 [tserver.TabletServer] DEBUG: MultiScanSess 
> master:58287 1 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
> 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
> 0x04df6fa0
> 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
> 0x04df7040
> 2015-06-10 21:07:40,609 [tserver.NativeMap] DEBUG: Deallocating native map 
> 0x022b4da0
> 2015-06-10 21:07:40,609 [tablet.Tablet] TABLET_HIST: +r<< closed
> 2015-06-10 21:07:40,609 [tserver.TabletServer] DEBUG: Unassigning 
> +r<<@(null,tserver1:9997[24ddf314d8c0010],null)
> 2015-06-10 21:07:40,610 [state.ZooStore] DEBUG: Removing /root_tablet/location
> 2015-06-10 21:07:40,613 [state.ZooStore] DEBUG: Removing 
> /root_tablet/future_location
> 2015-06-10 21:07:40,614 [state.ZooTabletStateStore] DEBUG: unassign root 
> tablet location
> 2015-06-10 21:07:40,619 [tserver.TabletServer] INFO : unloaded +r<<
> {noformat}
> After this, however, we can see it tries to run a minor compaction on a user 
> tablet (1<<)
> {noformat}
> 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
> queueMinC=false disableWrites=false) !0;~<
> 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: completeClose(saveState=true 
> completeClose=true) !0;~<
> 2015-06-10 21:07:40,622 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
> queueMinC=false disableWrites=false) 1<<
> 2015-06-10 21:07:40,623 [tserver.NativeMap] DEBUG: Allocated native map 
> 0x02e82e00
> 2015-06-10 21:07:40,624 [impl.ThriftScanner] DEBUG: Failed to locate tablet 
> for table : +r row : !0;
> 2015-06-10 21:07:40,633 [tablet.MinorCompactor] DEBUG: Begin minor compaction 
> hdfs://namenode:8020/apps/accumulo/data/tables/1/default_tablet/F007.rf_tmp
>  1<<
> 2015-06-10 21:07:40,759 [tablet.Compactor] DEBUG: Compaction 1<< 590 read | 
> 590 written | 12,553 entries/sec |  0.047 secs
> 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs for memory compacted: 1<< 
> tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
> 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs to be destroyed: 1<< 
> tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
> 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: 
> BEFORE compactionThreshold = 0.550 maxObserved = 150,256
> 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER 
> compactionThreshold = 0.605
> {noformat}
> So, as we might expect, this is never going to finish. The close won't happen 
> because root isn't assigned, and the master isn't going to reassign root 
> unless it gets a new tserver (which it won't because we're trying to stop the 
> entire instance).
> To verify this is what's gumming up the works:
> {noformat}
> "tablet migration 1" daemon prio=10 tid=0x02e84800 nid=0x671 waiting 
> on condition [0x7fb80a03a000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.accumulo.core.util.UtilWaitThread.sleep(UtilWaitThread.java:27)
>   at 
> org.apache.accumulo.core.client.impl.RootTabletLocator.locateTablet(RootTabletLocator.java:129)
>   at 
> 

[jira] [Commented] (ACCUMULO-3898) `admin stop` created deadlock in TabletServer being stopped

2017-07-07 Thread John Vines (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078454#comment-16078454
 ] 

John Vines commented on ACCUMULO-3898:
--

I'm actually seeing this with some regular occurance on 1.7.4 in small 3-4 node 
setups. I issue a stop for all of my tservers in my cluster, and I frequently 
found myself left with 1 stuck hosting just the metadata tablets (but no root 
tablet).

> `admin stop` created deadlock in TabletServer being stopped
> ---
>
> Key: ACCUMULO-3898
> URL: https://issues.apache.org/jira/browse/ACCUMULO-3898
> Project: Accumulo
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.7.0
>Reporter: Josh Elser
>Priority: Critical
> Fix For: 1.8.2, 2.0.0
>
> Attachments: 0001-ACCUMULO-3898-Test-to-reproduce.patch
>
>
> Take a tabletserver, tserver1, one of a few.
> The other tabletservers have already been requested to stop and all have 
> stopped (via {{admin stop}}).
> {noformat}
> 2015-06-10 21:07:40,485 [master.Master] DEBUG: Seeding FATE op to shutdown 
> tserver1:9997[24ddf314d8c0010] with tid 9215266058859846639
> 2015-06-10 21:07:40,503 [master.EventCoordinator] INFO : Tablet Server 
> shutdown requested for tserver1:9997[24ddf314d8c0010]
> {noformat}
> The master FATE op gets seeded to shutdown tserver1. As we can guess, since 
> it's the last server, it's hosting {{r<<}}. It unloads it
> {noformat}
> 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
> queueMinC=false disableWrites=false) +r<<
> 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: completeClose(saveState=true 
> completeClose=true) +r<<
> 2015-06-10 21:07:40,587 [tserver.TabletServer] DEBUG: MultiScanSess 
> master:58287 1 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
> 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
> 0x04df6fa0
> 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
> 0x04df7040
> 2015-06-10 21:07:40,609 [tserver.NativeMap] DEBUG: Deallocating native map 
> 0x022b4da0
> 2015-06-10 21:07:40,609 [tablet.Tablet] TABLET_HIST: +r<< closed
> 2015-06-10 21:07:40,609 [tserver.TabletServer] DEBUG: Unassigning 
> +r<<@(null,tserver1:9997[24ddf314d8c0010],null)
> 2015-06-10 21:07:40,610 [state.ZooStore] DEBUG: Removing /root_tablet/location
> 2015-06-10 21:07:40,613 [state.ZooStore] DEBUG: Removing 
> /root_tablet/future_location
> 2015-06-10 21:07:40,614 [state.ZooTabletStateStore] DEBUG: unassign root 
> tablet location
> 2015-06-10 21:07:40,619 [tserver.TabletServer] INFO : unloaded +r<<
> {noformat}
> After this, however, we can see it tries to run a minor compaction on a user 
> tablet (1<<)
> {noformat}
> 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
> queueMinC=false disableWrites=false) !0;~<
> 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: completeClose(saveState=true 
> completeClose=true) !0;~<
> 2015-06-10 21:07:40,622 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
> queueMinC=false disableWrites=false) 1<<
> 2015-06-10 21:07:40,623 [tserver.NativeMap] DEBUG: Allocated native map 
> 0x02e82e00
> 2015-06-10 21:07:40,624 [impl.ThriftScanner] DEBUG: Failed to locate tablet 
> for table : +r row : !0;
> 2015-06-10 21:07:40,633 [tablet.MinorCompactor] DEBUG: Begin minor compaction 
> hdfs://namenode:8020/apps/accumulo/data/tables/1/default_tablet/F007.rf_tmp
>  1<<
> 2015-06-10 21:07:40,759 [tablet.Compactor] DEBUG: Compaction 1<< 590 read | 
> 590 written | 12,553 entries/sec |  0.047 secs
> 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs for memory compacted: 1<< 
> tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
> 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs to be destroyed: 1<< 
> tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
> 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: 
> BEFORE compactionThreshold = 0.550 maxObserved = 150,256
> 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER 
> compactionThreshold = 0.605
> {noformat}
> So, as we might expect, this is never going to finish. The close won't happen 
> because root isn't assigned, and the master isn't going to reassign root 
> unless it gets a new tserver (which it won't because we're trying to stop the 
> entire instance).
> To verify this is what's gumming up the works:
> {noformat}
> "tablet migration 1" daemon prio=10 tid=0x02e84800 nid=0x671 waiting 
> on condition [0x7fb80a03a000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> 

[jira] [Commented] (ACCUMULO-3898) `admin stop` created deadlock in TabletServer being stopped

2015-10-31 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984155#comment-14984155
 ] 

Josh Elser commented on ACCUMULO-3898:
--

Dropped 1.7.1 from the fixVersion. This is tricky to get right and my previous 
approach was a bit of a hack IIRC. I think this requires some real thought 
behind it to do this correctly. The workaround is to just not use {{admin 
stop}} ({{kill $pid}} instead), which isn't ideal, but should be acceptable..

> `admin stop` created deadlock in TabletServer being stopped
> ---
>
> Key: ACCUMULO-3898
> URL: https://issues.apache.org/jira/browse/ACCUMULO-3898
> Project: Accumulo
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.7.0
>Reporter: Josh Elser
>Priority: Critical
> Fix For: 1.8.0
>
> Attachments: 0001-ACCUMULO-3898-Test-to-reproduce.patch
>
>
> Take a tabletserver, tserver1, one of a few.
> The other tabletservers have already been requested to stop and all have 
> stopped (via {{admin stop}}).
> {noformat}
> 2015-06-10 21:07:40,485 [master.Master] DEBUG: Seeding FATE op to shutdown 
> tserver1:9997[24ddf314d8c0010] with tid 9215266058859846639
> 2015-06-10 21:07:40,503 [master.EventCoordinator] INFO : Tablet Server 
> shutdown requested for tserver1:9997[24ddf314d8c0010]
> {noformat}
> The master FATE op gets seeded to shutdown tserver1. As we can guess, since 
> it's the last server, it's hosting {{r<<}}. It unloads it
> {noformat}
> 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
> queueMinC=false disableWrites=false) +r<<
> 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: completeClose(saveState=true 
> completeClose=true) +r<<
> 2015-06-10 21:07:40,587 [tserver.TabletServer] DEBUG: MultiScanSess 
> master:58287 1 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
> 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
> 0x04df6fa0
> 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
> 0x04df7040
> 2015-06-10 21:07:40,609 [tserver.NativeMap] DEBUG: Deallocating native map 
> 0x022b4da0
> 2015-06-10 21:07:40,609 [tablet.Tablet] TABLET_HIST: +r<< closed
> 2015-06-10 21:07:40,609 [tserver.TabletServer] DEBUG: Unassigning 
> +r<<@(null,tserver1:9997[24ddf314d8c0010],null)
> 2015-06-10 21:07:40,610 [state.ZooStore] DEBUG: Removing /root_tablet/location
> 2015-06-10 21:07:40,613 [state.ZooStore] DEBUG: Removing 
> /root_tablet/future_location
> 2015-06-10 21:07:40,614 [state.ZooTabletStateStore] DEBUG: unassign root 
> tablet location
> 2015-06-10 21:07:40,619 [tserver.TabletServer] INFO : unloaded +r<<
> {noformat}
> After this, however, we can see it tries to run a minor compaction on a user 
> tablet (1<<)
> {noformat}
> 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
> queueMinC=false disableWrites=false) !0;~<
> 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: completeClose(saveState=true 
> completeClose=true) !0;~<
> 2015-06-10 21:07:40,622 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
> queueMinC=false disableWrites=false) 1<<
> 2015-06-10 21:07:40,623 [tserver.NativeMap] DEBUG: Allocated native map 
> 0x02e82e00
> 2015-06-10 21:07:40,624 [impl.ThriftScanner] DEBUG: Failed to locate tablet 
> for table : +r row : !0;
> 2015-06-10 21:07:40,633 [tablet.MinorCompactor] DEBUG: Begin minor compaction 
> hdfs://namenode:8020/apps/accumulo/data/tables/1/default_tablet/F007.rf_tmp
>  1<<
> 2015-06-10 21:07:40,759 [tablet.Compactor] DEBUG: Compaction 1<< 590 read | 
> 590 written | 12,553 entries/sec |  0.047 secs
> 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs for memory compacted: 1<< 
> tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
> 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs to be destroyed: 1<< 
> tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
> 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: 
> BEFORE compactionThreshold = 0.550 maxObserved = 150,256
> 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER 
> compactionThreshold = 0.605
> {noformat}
> So, as we might expect, this is never going to finish. The close won't happen 
> because root isn't assigned, and the master isn't going to reassign root 
> unless it gets a new tserver (which it won't because we're trying to stop the 
> entire instance).
> To verify this is what's gumming up the works:
> {noformat}
> "tablet migration 1" daemon prio=10 tid=0x02e84800 nid=0x671 waiting 
> on condition [0x7fb80a03a000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at 

[jira] [Commented] (ACCUMULO-3898) `admin stop` created deadlock in TabletServer being stopped

2015-06-11 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582124#comment-14582124
 ] 

Josh Elser commented on ACCUMULO-3898:
--

Was trying to think of alternative ways to approach this because I have a hunch 
that trying, in the Master, to definitively know when we're down to 1 TServer 
is a wash:

* Time/Retry bound the UnloadTabletHandler. If we're trying to unload a tablet, 
eventually we could give up and just let log recovery happen. Between deadlock 
and re-calculating work: I'd take the latter.
* Bound the execution of ShutdownTServer -- if it takes too long. Just call 
halt() on it.

I had also been playing with the following, but even in the naive IT attached, 
it was subject to race conditions. It might help, but, if added, it should be 
marked as a best-effort and not a definitively correct action:

{code}
diff --git 
a/server/master/src/main/java/org/apache/accumulo/master/TabletGroupWatcher.java
 
b/server/master/src/main/java/org/apache/accumulo/master/TabletGroupWatcher.java
index 4c47953..bcaf23f 100644
--- 
a/server/master/src/main/java/org/apache/accumulo/master/TabletGroupWatcher.java
+++ 
b/server/master/src/main/java/org/apache/accumulo/master/TabletGroupWatcher.java
@@ -223,6 +223,11 @@ class TabletGroupWatcher extends Daemon {
 goal = TabletGoalState.HOSTED;
   }
 }
+  } else if (state == TabletState.HOSTED  dependentWatcher != null 
 this.master.onlineTabletServers().size() == 1) {
+Master.log.debug(Only one TabletServer exists, ensuring tablets 
not unloaded before dependent tablets);
+if (dependentWatcher.assignedOrHosted()  0) {
+  goal = TabletGoalState.HOSTED;
+}
   }

   if (goal == TabletGoalState.HOSTED) {
{code}

 `admin stop` created deadlock in TabletServer being stopped
 ---

 Key: ACCUMULO-3898
 URL: https://issues.apache.org/jira/browse/ACCUMULO-3898
 Project: Accumulo
  Issue Type: Bug
  Components: tserver
Affects Versions: 1.7.0
Reporter: Josh Elser
Priority: Critical
 Fix For: 1.7.1, 1.8.0

 Attachments: 0001-ACCUMULO-3898-Test-to-reproduce.patch


 Take a tabletserver, tserver1, one of a few.
 The other tabletservers have already been requested to stop and all have 
 stopped (via {{admin stop}}).
 {noformat}
 2015-06-10 21:07:40,485 [master.Master] DEBUG: Seeding FATE op to shutdown 
 tserver1:9997[24ddf314d8c0010] with tid 9215266058859846639
 2015-06-10 21:07:40,503 [master.EventCoordinator] INFO : Tablet Server 
 shutdown requested for tserver1:9997[24ddf314d8c0010]
 {noformat}
 The master FATE op gets seeded to shutdown tserver1. As we can guess, since 
 it's the last server, it's hosting {{r}}. It unloads it
 {noformat}
 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) +r
 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: completeClose(saveState=true 
 completeClose=true) +r
 2015-06-10 21:07:40,587 [tserver.TabletServer] DEBUG: MultiScanSess 
 master:58287 1 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x04df6fa0
 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x04df7040
 2015-06-10 21:07:40,609 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x022b4da0
 2015-06-10 21:07:40,609 [tablet.Tablet] TABLET_HIST: +r closed
 2015-06-10 21:07:40,609 [tserver.TabletServer] DEBUG: Unassigning 
 +r@(null,tserver1:9997[24ddf314d8c0010],null)
 2015-06-10 21:07:40,610 [state.ZooStore] DEBUG: Removing /root_tablet/location
 2015-06-10 21:07:40,613 [state.ZooStore] DEBUG: Removing 
 /root_tablet/future_location
 2015-06-10 21:07:40,614 [state.ZooTabletStateStore] DEBUG: unassign root 
 tablet location
 2015-06-10 21:07:40,619 [tserver.TabletServer] INFO : unloaded +r
 {noformat}
 After this, however, we can see it tries to run a minor compaction on a user 
 tablet (1)
 {noformat}
 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) !0;~
 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: completeClose(saveState=true 
 completeClose=true) !0;~
 2015-06-10 21:07:40,622 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) 1
 2015-06-10 21:07:40,623 [tserver.NativeMap] DEBUG: Allocated native map 
 0x02e82e00
 2015-06-10 21:07:40,624 [impl.ThriftScanner] DEBUG: Failed to locate tablet 
 for table : +r row : !0;
 2015-06-10 21:07:40,633 [tablet.MinorCompactor] DEBUG: Begin minor compaction 
 

[jira] [Commented] (ACCUMULO-3898) `admin stop` created deadlock in TabletServer being stopped

2015-06-10 Thread Eric Newton (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581329#comment-14581329
 ] 

Eric Newton commented on ACCUMULO-3898:
---

bq. Seems to be a misnomer.

It is. Lots of bits of code dealing with tablet unloading are named migration 
which isn't really correct, because tservers only care about assignments, and 
unloads. The code dealing with migrations in the master are generally related 
to balancing, and are true migrations.


 `admin stop` created deadlock in TabletServer being stopped
 ---

 Key: ACCUMULO-3898
 URL: https://issues.apache.org/jira/browse/ACCUMULO-3898
 Project: Accumulo
  Issue Type: Bug
  Components: tserver
Affects Versions: 1.7.0
Reporter: Josh Elser
Priority: Critical
 Fix For: 1.7.1, 1.8.0


 Take a tabletserver, tserver1, one of a few.
 The other tabletservers have already been requested to stop and all have 
 stopped (via {{admin stop}}).
 {noformat}
 2015-06-10 21:07:40,485 [master.Master] DEBUG: Seeding FATE op to shutdown 
 tserver1:9997[24ddf314d8c0010] with tid 9215266058859846639
 2015-06-10 21:07:40,503 [master.EventCoordinator] INFO : Tablet Server 
 shutdown requested for tserver1:9997[24ddf314d8c0010]
 {noformat}
 The master FATE op gets seeded to shutdown tserver1. As we can guess, since 
 it's the last server, it's hosting {{r}}. It unloads it
 {noformat}
 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) +r
 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: completeClose(saveState=true 
 completeClose=true) +r
 2015-06-10 21:07:40,587 [tserver.TabletServer] DEBUG: MultiScanSess 
 master:58287 1 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x04df6fa0
 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x04df7040
 2015-06-10 21:07:40,609 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x022b4da0
 2015-06-10 21:07:40,609 [tablet.Tablet] TABLET_HIST: +r closed
 2015-06-10 21:07:40,609 [tserver.TabletServer] DEBUG: Unassigning 
 +r@(null,tserver1:9997[24ddf314d8c0010],null)
 2015-06-10 21:07:40,610 [state.ZooStore] DEBUG: Removing /root_tablet/location
 2015-06-10 21:07:40,613 [state.ZooStore] DEBUG: Removing 
 /root_tablet/future_location
 2015-06-10 21:07:40,614 [state.ZooTabletStateStore] DEBUG: unassign root 
 tablet location
 2015-06-10 21:07:40,619 [tserver.TabletServer] INFO : unloaded +r
 {noformat}
 After this, however, we can see it tries to run a minor compaction on a user 
 tablet (1)
 {noformat}
 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) !0;~
 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: completeClose(saveState=true 
 completeClose=true) !0;~
 2015-06-10 21:07:40,622 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) 1
 2015-06-10 21:07:40,623 [tserver.NativeMap] DEBUG: Allocated native map 
 0x02e82e00
 2015-06-10 21:07:40,624 [impl.ThriftScanner] DEBUG: Failed to locate tablet 
 for table : +r row : !0;
 2015-06-10 21:07:40,633 [tablet.MinorCompactor] DEBUG: Begin minor compaction 
 hdfs://namenode:8020/apps/accumulo/data/tables/1/default_tablet/F007.rf_tmp
  1
 2015-06-10 21:07:40,759 [tablet.Compactor] DEBUG: Compaction 1 590 read | 
 590 written | 12,553 entries/sec |  0.047 secs
 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs for memory compacted: 1 
 tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs to be destroyed: 1 
 tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: 
 BEFORE compactionThreshold = 0.550 maxObserved = 150,256
 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER 
 compactionThreshold = 0.605
 {noformat}
 So, as we might expect, this is never going to finish. The close won't happen 
 because root isn't assigned, and the master isn't going to reassign root 
 unless it gets a new tserver (which it won't because we're trying to stop the 
 entire instance).
 To verify this is what's gumming up the works:
 {noformat}
 tablet migration 1 daemon prio=10 tid=0x02e84800 nid=0x671 waiting 
 on condition [0x7fb80a03a000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at 
 org.apache.accumulo.core.util.UtilWaitThread.sleep(UtilWaitThread.java:27)
   at 
 

[jira] [Commented] (ACCUMULO-3898) `admin stop` created deadlock in TabletServer being stopped

2015-06-10 Thread Eric Newton (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581338#comment-14581338
 ] 

Eric Newton commented on ACCUMULO-3898:
---

What is the desired behavior when stopping all tservers (and not the master)?  
Should it begin assigning tablets when a new tablet server comes up?  Maybe we 
need a new Master goal state: NORMAL, SAFE_MODE, MASTER_OF_THE_DEAD and 
CLEAN_STOP. That way we can transition through the stopping, and yet not stop 
the master.


 `admin stop` created deadlock in TabletServer being stopped
 ---

 Key: ACCUMULO-3898
 URL: https://issues.apache.org/jira/browse/ACCUMULO-3898
 Project: Accumulo
  Issue Type: Bug
  Components: tserver
Affects Versions: 1.7.0
Reporter: Josh Elser
Priority: Critical
 Fix For: 1.7.1, 1.8.0


 Take a tabletserver, tserver1, one of a few.
 The other tabletservers have already been requested to stop and all have 
 stopped (via {{admin stop}}).
 {noformat}
 2015-06-10 21:07:40,485 [master.Master] DEBUG: Seeding FATE op to shutdown 
 tserver1:9997[24ddf314d8c0010] with tid 9215266058859846639
 2015-06-10 21:07:40,503 [master.EventCoordinator] INFO : Tablet Server 
 shutdown requested for tserver1:9997[24ddf314d8c0010]
 {noformat}
 The master FATE op gets seeded to shutdown tserver1. As we can guess, since 
 it's the last server, it's hosting {{r}}. It unloads it
 {noformat}
 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) +r
 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: completeClose(saveState=true 
 completeClose=true) +r
 2015-06-10 21:07:40,587 [tserver.TabletServer] DEBUG: MultiScanSess 
 master:58287 1 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x04df6fa0
 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x04df7040
 2015-06-10 21:07:40,609 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x022b4da0
 2015-06-10 21:07:40,609 [tablet.Tablet] TABLET_HIST: +r closed
 2015-06-10 21:07:40,609 [tserver.TabletServer] DEBUG: Unassigning 
 +r@(null,tserver1:9997[24ddf314d8c0010],null)
 2015-06-10 21:07:40,610 [state.ZooStore] DEBUG: Removing /root_tablet/location
 2015-06-10 21:07:40,613 [state.ZooStore] DEBUG: Removing 
 /root_tablet/future_location
 2015-06-10 21:07:40,614 [state.ZooTabletStateStore] DEBUG: unassign root 
 tablet location
 2015-06-10 21:07:40,619 [tserver.TabletServer] INFO : unloaded +r
 {noformat}
 After this, however, we can see it tries to run a minor compaction on a user 
 tablet (1)
 {noformat}
 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) !0;~
 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: completeClose(saveState=true 
 completeClose=true) !0;~
 2015-06-10 21:07:40,622 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) 1
 2015-06-10 21:07:40,623 [tserver.NativeMap] DEBUG: Allocated native map 
 0x02e82e00
 2015-06-10 21:07:40,624 [impl.ThriftScanner] DEBUG: Failed to locate tablet 
 for table : +r row : !0;
 2015-06-10 21:07:40,633 [tablet.MinorCompactor] DEBUG: Begin minor compaction 
 hdfs://namenode:8020/apps/accumulo/data/tables/1/default_tablet/F007.rf_tmp
  1
 2015-06-10 21:07:40,759 [tablet.Compactor] DEBUG: Compaction 1 590 read | 
 590 written | 12,553 entries/sec |  0.047 secs
 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs for memory compacted: 1 
 tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs to be destroyed: 1 
 tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: 
 BEFORE compactionThreshold = 0.550 maxObserved = 150,256
 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER 
 compactionThreshold = 0.605
 {noformat}
 So, as we might expect, this is never going to finish. The close won't happen 
 because root isn't assigned, and the master isn't going to reassign root 
 unless it gets a new tserver (which it won't because we're trying to stop the 
 entire instance).
 To verify this is what's gumming up the works:
 {noformat}
 tablet migration 1 daemon prio=10 tid=0x02e84800 nid=0x671 waiting 
 on condition [0x7fb80a03a000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at 
 org.apache.accumulo.core.util.UtilWaitThread.sleep(UtilWaitThread.java:27)
   at 
 

[jira] [Commented] (ACCUMULO-3898) `admin stop` created deadlock in TabletServer being stopped

2015-06-10 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581283#comment-14581283
 ] 

Josh Elser commented on ACCUMULO-3898:
--

All I know about {{1}} is (coalesced across all logs)

{noformat}
2015-06-10 20:41:55,228 [master.Master] DEBUG: Normal Tablets assigning 
tablets: [1=ip-172-31-39-246.ec2.internal:9997[24ddf314d8c0010]]
2015-06-10 20:41:55,237 [tserver.TabletServer] INFO : Loading tablet 1
2015-06-10 20:41:55,240 [tserver.TabletServer] INFO : 
ip-172-31-39-246.ec2.internal:9997: got assignment from master: 1
2015-06-10 20:41:55,256 [tserver.TabletServer] DEBUG: Loading extent: 1
2015-06-10 20:41:55,256 [tserver.TabletServer] DEBUG: verifying extent 1
2015-06-10 20:41:55,368 [tablet.Tablet] DEBUG: got [] for logs for 1
2015-06-10 20:41:55,413 [tablet.Tablet] DEBUG: Reloading constraints for 
extent: 1
2015-06-10 20:41:55,442 [tablet.Tablet] TABLET_HIST: 1 opened
2015-06-10 20:41:55,529 [master.EventCoordinator] INFO : tablet 1 was loaded 
on ip-172-31-39-246.ec2.internal:9997
2015-06-10 20:41:55,575 [tablet.Tablet] DEBUG: Reloading constraints for 
extent: 1
2015-06-10 20:41:55,607 [tablet.Tablet] DEBUG: Reloading constraints for 
extent: 1
2015-06-10 20:41:55,620 [tablet.Tablet] DEBUG: Reloading constraints for 
extent: 1
2015-06-10 20:41:55,634 [tablet.Tablet] DEBUG: Reloading constraints for 
extent: 1
2015-06-10 20:41:55,651 [tablet.Tablet] DEBUG: Reloading constraints for 
extent: 1
2015-06-10 20:41:55,665 [tablet.Tablet] DEBUG: Reloading constraints for 
extent: 1
2015-06-10 20:46:10,983 [tserver.TabletServer] INFO : Adding 1 logs for extent 
1 as alias 2
2015-06-10 21:07:40,622 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
queueMinC=false disableWrites=false) 1
2015-06-10 21:07:40,633 [tablet.MinorCompactor] DEBUG: Begin minor compaction 
hdfs://ip-172-31-39-245.ec2.internal:8020/apps/accumulo/data/tables/1/default_tablet/F007.rf_tmp
 1
2015-06-10 21:07:40,759 [tablet.Compactor] DEBUG: Compaction 1 590 read | 590 
written | 12,553 entries/sec |  0.047 secs
2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs for memory compacted: 1 
ip-172-31-39-246.ec2.internal:9997/hdfs://ip-172-31-39-245.ec2.internal:8020/apps/accumulo/data/wal/ip-172-31-39-246.ec2.internal+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs to be destroyed: 1 
ip-172-31-39-246.ec2.internal:9997/hdfs://ip-172-31-39-245.ec2.internal:8020/apps/accumulo/data/wal/ip-172-31-39-246.ec2.internal+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
{noformat}

 `admin stop` created deadlock in TabletServer being stopped
 ---

 Key: ACCUMULO-3898
 URL: https://issues.apache.org/jira/browse/ACCUMULO-3898
 Project: Accumulo
  Issue Type: Bug
  Components: tserver
Affects Versions: 1.7.0
Reporter: Josh Elser
Priority: Critical
 Fix For: 1.7.1, 1.8.0


 Take a tabletserver, tserver1, one of a few.
 The other tabletservers have already been requested to stop and all have 
 stopped (via {{admin stop}}).
 {noformat}
 2015-06-10 21:07:40,485 [master.Master] DEBUG: Seeding FATE op to shutdown 
 tserver1:9997[24ddf314d8c0010] with tid 9215266058859846639
 2015-06-10 21:07:40,503 [master.EventCoordinator] INFO : Tablet Server 
 shutdown requested for tserver1:9997[24ddf314d8c0010]
 {noformat}
 The master FATE op gets seeded to shutdown tserver1. As we can guess, since 
 it's the last server, it's hosting {{r}}. It unloads it
 {noformat}
 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) +r
 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: completeClose(saveState=true 
 completeClose=true) +r
 2015-06-10 21:07:40,587 [tserver.TabletServer] DEBUG: MultiScanSess 
 master:58287 1 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x04df6fa0
 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x04df7040
 2015-06-10 21:07:40,609 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x022b4da0
 2015-06-10 21:07:40,609 [tablet.Tablet] TABLET_HIST: +r closed
 2015-06-10 21:07:40,609 [tserver.TabletServer] DEBUG: Unassigning 
 +r@(null,tserver1:9997[24ddf314d8c0010],null)
 2015-06-10 21:07:40,610 [state.ZooStore] DEBUG: Removing /root_tablet/location
 2015-06-10 21:07:40,613 [state.ZooStore] DEBUG: Removing 
 /root_tablet/future_location
 2015-06-10 21:07:40,614 [state.ZooTabletStateStore] DEBUG: unassign root 
 tablet location
 2015-06-10 21:07:40,619 [tserver.TabletServer] INFO : unloaded +r
 {noformat}
 After this, however, we can see it tries to run a minor compaction on a user 
 tablet (1)
 {noformat}
 2015-06-10 21:07:40,621 

[jira] [Commented] (ACCUMULO-3898) `admin stop` created deadlock in TabletServer being stopped

2015-06-10 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581277#comment-14581277
 ] 

Josh Elser commented on ACCUMULO-3898:
--

Best guess at this point is that the master shouldn't have requested an unload 
of root yet. The UnloadTabletHandler should have been only invoked by master 
rpc (best as I can tell). The usual candidates (jvm pauses, network partitions) 
should really be relevant (in my head) since the master should have all of the 
user tablets offline before taking off meta and then root.

Sadly, the logging in the master is tremendously sparse around this area in the 
code. Might need some [~kturner] [~ecn] love.

 `admin stop` created deadlock in TabletServer being stopped
 ---

 Key: ACCUMULO-3898
 URL: https://issues.apache.org/jira/browse/ACCUMULO-3898
 Project: Accumulo
  Issue Type: Bug
  Components: tserver
Affects Versions: 1.7.0
Reporter: Josh Elser
Priority: Critical
 Fix For: 1.7.1, 1.8.0


 Take a tabletserver, tserver1, one of a few.
 The other tabletservers have already been requested to stop and all have 
 stopped (via {{admin stop}}).
 {noformat}
 2015-06-10 21:07:40,485 [master.Master] DEBUG: Seeding FATE op to shutdown 
 tserver1:9997[24ddf314d8c0010] with tid 9215266058859846639
 2015-06-10 21:07:40,503 [master.EventCoordinator] INFO : Tablet Server 
 shutdown requested for tserver1:9997[24ddf314d8c0010]
 {noformat}
 The master FATE op gets seeded to shutdown tserver1. As we can guess, since 
 it's the last server, it's hosting {{r}}. It unloads it
 {noformat}
 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) +r
 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: completeClose(saveState=true 
 completeClose=true) +r
 2015-06-10 21:07:40,587 [tserver.TabletServer] DEBUG: MultiScanSess 
 master:58287 1 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x04df6fa0
 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x04df7040
 2015-06-10 21:07:40,609 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x022b4da0
 2015-06-10 21:07:40,609 [tablet.Tablet] TABLET_HIST: +r closed
 2015-06-10 21:07:40,609 [tserver.TabletServer] DEBUG: Unassigning 
 +r@(null,tserver1:9997[24ddf314d8c0010],null)
 2015-06-10 21:07:40,610 [state.ZooStore] DEBUG: Removing /root_tablet/location
 2015-06-10 21:07:40,613 [state.ZooStore] DEBUG: Removing 
 /root_tablet/future_location
 2015-06-10 21:07:40,614 [state.ZooTabletStateStore] DEBUG: unassign root 
 tablet location
 2015-06-10 21:07:40,619 [tserver.TabletServer] INFO : unloaded +r
 {noformat}
 After this, however, we can see it tries to run a minor compaction on a user 
 tablet (1)
 {noformat}
 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) !0;~
 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: completeClose(saveState=true 
 completeClose=true) !0;~
 2015-06-10 21:07:40,622 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) 1
 2015-06-10 21:07:40,623 [tserver.NativeMap] DEBUG: Allocated native map 
 0x02e82e00
 2015-06-10 21:07:40,624 [impl.ThriftScanner] DEBUG: Failed to locate tablet 
 for table : +r row : !0;
 2015-06-10 21:07:40,633 [tablet.MinorCompactor] DEBUG: Begin minor compaction 
 hdfs://namenode:8020/apps/accumulo/data/tables/1/default_tablet/F007.rf_tmp
  1
 2015-06-10 21:07:40,759 [tablet.Compactor] DEBUG: Compaction 1 590 read | 
 590 written | 12,553 entries/sec |  0.047 secs
 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs for memory compacted: 1 
 tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs to be destroyed: 1 
 tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: 
 BEFORE compactionThreshold = 0.550 maxObserved = 150,256
 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER 
 compactionThreshold = 0.605
 {noformat}
 So, as we might expect, this is never going to finish. The close won't happen 
 because root isn't assigned, and the master isn't going to reassign root 
 unless it gets a new tserver (which it won't because we're trying to stop the 
 entire instance).
 To verify this is what's gumming up the works:
 {noformat}
 tablet migration 1 daemon prio=10 tid=0x02e84800 nid=0x671 waiting 
 on condition [0x7fb80a03a000]
java.lang.Thread.State: 

[jira] [Commented] (ACCUMULO-3898) `admin stop` created deadlock in TabletServer being stopped

2015-06-10 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581308#comment-14581308
 ] 

Josh Elser commented on ACCUMULO-3898:
--

Alright, the offending bit (as best I can tell) is in TabletGroupWatcher.java

{code}
  // if we are shutting down all the tabletservers, we have to do it in 
order
  if (goal == TabletGoalState.UNASSIGNED  state == 
TabletState.HOSTED) {
if (this.master.serversToShutdown.equals(currentTServers.keySet())) 
{
  if (dependentWatcher != null  
dependentWatcher.assignedOrHosted()  0) {
goal = TabletGoalState.HOSTED;
  }
}
  }
{code}

We _only_ consult the dependentWatcher (meta for users, root for meta, none for 
root), when the Master's goal state is going down. So, that means we're 
guaranteed to deadlock on `admin stop` for the last TabletServer if the 
goalState isn't set to something that is stopping (e.g. SAFE_MODE, 
UNLOAD_METADATA_TABLETS, UNLOAD_ROOT_TABLET, STOP).

 `admin stop` created deadlock in TabletServer being stopped
 ---

 Key: ACCUMULO-3898
 URL: https://issues.apache.org/jira/browse/ACCUMULO-3898
 Project: Accumulo
  Issue Type: Bug
  Components: tserver
Affects Versions: 1.7.0
Reporter: Josh Elser
Priority: Critical
 Fix For: 1.7.1, 1.8.0


 Take a tabletserver, tserver1, one of a few.
 The other tabletservers have already been requested to stop and all have 
 stopped (via {{admin stop}}).
 {noformat}
 2015-06-10 21:07:40,485 [master.Master] DEBUG: Seeding FATE op to shutdown 
 tserver1:9997[24ddf314d8c0010] with tid 9215266058859846639
 2015-06-10 21:07:40,503 [master.EventCoordinator] INFO : Tablet Server 
 shutdown requested for tserver1:9997[24ddf314d8c0010]
 {noformat}
 The master FATE op gets seeded to shutdown tserver1. As we can guess, since 
 it's the last server, it's hosting {{r}}. It unloads it
 {noformat}
 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) +r
 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: completeClose(saveState=true 
 completeClose=true) +r
 2015-06-10 21:07:40,587 [tserver.TabletServer] DEBUG: MultiScanSess 
 master:58287 1 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x04df6fa0
 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x04df7040
 2015-06-10 21:07:40,609 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x022b4da0
 2015-06-10 21:07:40,609 [tablet.Tablet] TABLET_HIST: +r closed
 2015-06-10 21:07:40,609 [tserver.TabletServer] DEBUG: Unassigning 
 +r@(null,tserver1:9997[24ddf314d8c0010],null)
 2015-06-10 21:07:40,610 [state.ZooStore] DEBUG: Removing /root_tablet/location
 2015-06-10 21:07:40,613 [state.ZooStore] DEBUG: Removing 
 /root_tablet/future_location
 2015-06-10 21:07:40,614 [state.ZooTabletStateStore] DEBUG: unassign root 
 tablet location
 2015-06-10 21:07:40,619 [tserver.TabletServer] INFO : unloaded +r
 {noformat}
 After this, however, we can see it tries to run a minor compaction on a user 
 tablet (1)
 {noformat}
 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) !0;~
 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: completeClose(saveState=true 
 completeClose=true) !0;~
 2015-06-10 21:07:40,622 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) 1
 2015-06-10 21:07:40,623 [tserver.NativeMap] DEBUG: Allocated native map 
 0x02e82e00
 2015-06-10 21:07:40,624 [impl.ThriftScanner] DEBUG: Failed to locate tablet 
 for table : +r row : !0;
 2015-06-10 21:07:40,633 [tablet.MinorCompactor] DEBUG: Begin minor compaction 
 hdfs://namenode:8020/apps/accumulo/data/tables/1/default_tablet/F007.rf_tmp
  1
 2015-06-10 21:07:40,759 [tablet.Compactor] DEBUG: Compaction 1 590 read | 
 590 written | 12,553 entries/sec |  0.047 secs
 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs for memory compacted: 1 
 tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs to be destroyed: 1 
 tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: 
 BEFORE compactionThreshold = 0.550 maxObserved = 150,256
 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER 
 compactionThreshold = 0.605
 {noformat}
 So, as we might expect, this is never going to finish. The close won't happen 
 because root 

[jira] [Commented] (ACCUMULO-3898) `admin stop` created deadlock in TabletServer being stopped

2015-06-10 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581342#comment-14581342
 ] 

Josh Elser commented on ACCUMULO-3898:
--

IMO, it shouldn't be necessary to set a GoalState at all to bounce the 
tabletservers (say I want to reload an SKVI on the TabletServers). Gracefully 
stopping the servers shouldn't have the possibility to deadlock.

Ideally, the dependentWatcher check should also include a conditional to ensure 
that, on the last TabletServer, we unload the tablets in the correct order. My 
hunch is that the difficulty lies in knowing when we're processing the last 
TabletServer.

Tryinh to write an IT to repro this is unsuccessful thus far. 

 `admin stop` created deadlock in TabletServer being stopped
 ---

 Key: ACCUMULO-3898
 URL: https://issues.apache.org/jira/browse/ACCUMULO-3898
 Project: Accumulo
  Issue Type: Bug
  Components: tserver
Affects Versions: 1.7.0
Reporter: Josh Elser
Priority: Critical
 Fix For: 1.7.1, 1.8.0


 Take a tabletserver, tserver1, one of a few.
 The other tabletservers have already been requested to stop and all have 
 stopped (via {{admin stop}}).
 {noformat}
 2015-06-10 21:07:40,485 [master.Master] DEBUG: Seeding FATE op to shutdown 
 tserver1:9997[24ddf314d8c0010] with tid 9215266058859846639
 2015-06-10 21:07:40,503 [master.EventCoordinator] INFO : Tablet Server 
 shutdown requested for tserver1:9997[24ddf314d8c0010]
 {noformat}
 The master FATE op gets seeded to shutdown tserver1. As we can guess, since 
 it's the last server, it's hosting {{r}}. It unloads it
 {noformat}
 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) +r
 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: completeClose(saveState=true 
 completeClose=true) +r
 2015-06-10 21:07:40,587 [tserver.TabletServer] DEBUG: MultiScanSess 
 master:58287 1 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x04df6fa0
 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x04df7040
 2015-06-10 21:07:40,609 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x022b4da0
 2015-06-10 21:07:40,609 [tablet.Tablet] TABLET_HIST: +r closed
 2015-06-10 21:07:40,609 [tserver.TabletServer] DEBUG: Unassigning 
 +r@(null,tserver1:9997[24ddf314d8c0010],null)
 2015-06-10 21:07:40,610 [state.ZooStore] DEBUG: Removing /root_tablet/location
 2015-06-10 21:07:40,613 [state.ZooStore] DEBUG: Removing 
 /root_tablet/future_location
 2015-06-10 21:07:40,614 [state.ZooTabletStateStore] DEBUG: unassign root 
 tablet location
 2015-06-10 21:07:40,619 [tserver.TabletServer] INFO : unloaded +r
 {noformat}
 After this, however, we can see it tries to run a minor compaction on a user 
 tablet (1)
 {noformat}
 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) !0;~
 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: completeClose(saveState=true 
 completeClose=true) !0;~
 2015-06-10 21:07:40,622 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) 1
 2015-06-10 21:07:40,623 [tserver.NativeMap] DEBUG: Allocated native map 
 0x02e82e00
 2015-06-10 21:07:40,624 [impl.ThriftScanner] DEBUG: Failed to locate tablet 
 for table : +r row : !0;
 2015-06-10 21:07:40,633 [tablet.MinorCompactor] DEBUG: Begin minor compaction 
 hdfs://namenode:8020/apps/accumulo/data/tables/1/default_tablet/F007.rf_tmp
  1
 2015-06-10 21:07:40,759 [tablet.Compactor] DEBUG: Compaction 1 590 read | 
 590 written | 12,553 entries/sec |  0.047 secs
 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs for memory compacted: 1 
 tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs to be destroyed: 1 
 tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: 
 BEFORE compactionThreshold = 0.550 maxObserved = 150,256
 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER 
 compactionThreshold = 0.605
 {noformat}
 So, as we might expect, this is never going to finish. The close won't happen 
 because root isn't assigned, and the master isn't going to reassign root 
 unless it gets a new tserver (which it won't because we're trying to stop the 
 entire instance).
 To verify this is what's gumming up the works:
 {noformat}
 tablet migration 1 daemon prio=10 tid=0x02e84800 nid=0x671 waiting 
 on condition [0x7fb80a03a000]

[jira] [Commented] (ACCUMULO-3898) `admin stop` created deadlock in TabletServer being stopped

2015-06-10 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581344#comment-14581344
 ] 

Josh Elser commented on ACCUMULO-3898:
--

bq. Tryinh to write an IT to repro this is unsuccessful thus far. 

Oo!! I did repro it. Although it took a few runs to actually have it fail.

 `admin stop` created deadlock in TabletServer being stopped
 ---

 Key: ACCUMULO-3898
 URL: https://issues.apache.org/jira/browse/ACCUMULO-3898
 Project: Accumulo
  Issue Type: Bug
  Components: tserver
Affects Versions: 1.7.0
Reporter: Josh Elser
Priority: Critical
 Fix For: 1.7.1, 1.8.0


 Take a tabletserver, tserver1, one of a few.
 The other tabletservers have already been requested to stop and all have 
 stopped (via {{admin stop}}).
 {noformat}
 2015-06-10 21:07:40,485 [master.Master] DEBUG: Seeding FATE op to shutdown 
 tserver1:9997[24ddf314d8c0010] with tid 9215266058859846639
 2015-06-10 21:07:40,503 [master.EventCoordinator] INFO : Tablet Server 
 shutdown requested for tserver1:9997[24ddf314d8c0010]
 {noformat}
 The master FATE op gets seeded to shutdown tserver1. As we can guess, since 
 it's the last server, it's hosting {{r}}. It unloads it
 {noformat}
 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) +r
 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: completeClose(saveState=true 
 completeClose=true) +r
 2015-06-10 21:07:40,587 [tserver.TabletServer] DEBUG: MultiScanSess 
 master:58287 1 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x04df6fa0
 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x04df7040
 2015-06-10 21:07:40,609 [tserver.NativeMap] DEBUG: Deallocating native map 
 0x022b4da0
 2015-06-10 21:07:40,609 [tablet.Tablet] TABLET_HIST: +r closed
 2015-06-10 21:07:40,609 [tserver.TabletServer] DEBUG: Unassigning 
 +r@(null,tserver1:9997[24ddf314d8c0010],null)
 2015-06-10 21:07:40,610 [state.ZooStore] DEBUG: Removing /root_tablet/location
 2015-06-10 21:07:40,613 [state.ZooStore] DEBUG: Removing 
 /root_tablet/future_location
 2015-06-10 21:07:40,614 [state.ZooTabletStateStore] DEBUG: unassign root 
 tablet location
 2015-06-10 21:07:40,619 [tserver.TabletServer] INFO : unloaded +r
 {noformat}
 After this, however, we can see it tries to run a minor compaction on a user 
 tablet (1)
 {noformat}
 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) !0;~
 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: completeClose(saveState=true 
 completeClose=true) !0;~
 2015-06-10 21:07:40,622 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
 queueMinC=false disableWrites=false) 1
 2015-06-10 21:07:40,623 [tserver.NativeMap] DEBUG: Allocated native map 
 0x02e82e00
 2015-06-10 21:07:40,624 [impl.ThriftScanner] DEBUG: Failed to locate tablet 
 for table : +r row : !0;
 2015-06-10 21:07:40,633 [tablet.MinorCompactor] DEBUG: Begin minor compaction 
 hdfs://namenode:8020/apps/accumulo/data/tables/1/default_tablet/F007.rf_tmp
  1
 2015-06-10 21:07:40,759 [tablet.Compactor] DEBUG: Compaction 1 590 read | 
 590 written | 12,553 entries/sec |  0.047 secs
 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs for memory compacted: 1 
 tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs to be destroyed: 1 
 tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: 
 BEFORE compactionThreshold = 0.550 maxObserved = 150,256
 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER 
 compactionThreshold = 0.605
 {noformat}
 So, as we might expect, this is never going to finish. The close won't happen 
 because root isn't assigned, and the master isn't going to reassign root 
 unless it gets a new tserver (which it won't because we're trying to stop the 
 entire instance).
 To verify this is what's gumming up the works:
 {noformat}
 tablet migration 1 daemon prio=10 tid=0x02e84800 nid=0x671 waiting 
 on condition [0x7fb80a03a000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at 
 org.apache.accumulo.core.util.UtilWaitThread.sleep(UtilWaitThread.java:27)
   at 
 org.apache.accumulo.core.client.impl.RootTabletLocator.locateTablet(RootTabletLocator.java:129)
   at