[jira] [Updated] (ACCUMULO-4410) Master didn't not resume balancing after administrative tserver shutdown

2018-11-30 Thread Ivan Bella (JIRA)


 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella updated ACCUMULO-4410:
-
   Resolution: Fixed
Fix Version/s: 1.9.3
   Status: Resolved  (was: Patch Available)

> Master didn't not resume balancing after administrative tserver shutdown
> 
>
> Key: ACCUMULO-4410
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4410
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.8.0, 1.9.0, 2.0.0
>Reporter: Josh Elser
>Assignee: Ivan Bella
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.9.3, 2.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I realized that I misconfigured a property, so, I started manually stopping 
> each tabletserver (using {{accumulo admin stop }}).
> This worked as intended, the tablets were migrated and the tserver was 
> stopped:
> {noformat}
> 2016-08-17 15:24:20,871 [master.EventCoordinator] INFO : Tablet Server 
> shutdown requested for 
> jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]
> 2016-08-17 15:24:20,991 [master.Master] DEBUG: FATE op shutting down 
> jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c] finished
> {noformat}
> However, after this point, the master did not resume balancing:
> {noformat}
> 2016-08-17 15:24:31,024 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.02 seconds
> 2016-08-17 15:24:31,024 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> 2016-08-17 15:24:36,831 [replication.WorkDriver] DEBUG: Sleeping 3 ms 
> before next work assignment
> 2016-08-17 15:24:41,074 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.05 seconds
> 2016-08-17 15:24:41,083 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> 2016-08-17 15:24:51,134 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.05 seconds
> 2016-08-17 15:24:51,135 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> {noformat}
> Even after I brought a new tserver online on that host, the master still did 
> not resume balancing:
> {noformat}
> 2016-08-17 15:25:53,015 [master.Master] INFO : New servers: 
> [jelser-accumulo-180-4.openstacklocal:54722[2568579a5c3006e]]
> 2016-08-17 15:25:53,026 [master.EventCoordinator] INFO : There are now 5 
> tablet servers
> 2016-08-17 15:25:53,096 [master.Master] DEBUG: Finished gathering information 
> from 5 servers in 0.06 seconds
> 2016-08-17 15:25:53,109 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ACCUMULO-4410) Master didn't not resume balancing after administrative tserver shutdown

2018-11-30 Thread Ivan Bella (JIRA)


[ 
https://issues.apache.org/jira/browse/ACCUMULO-4410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705279#comment-16705279
 ] 

Ivan Bella commented on ACCUMULO-4410:
--

Merged into 1.9 and 2.0 (master)

> Master didn't not resume balancing after administrative tserver shutdown
> 
>
> Key: ACCUMULO-4410
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4410
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.8.0, 1.9.0, 2.0.0
>Reporter: Josh Elser
>Assignee: Ivan Bella
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I realized that I misconfigured a property, so, I started manually stopping 
> each tabletserver (using {{accumulo admin stop }}).
> This worked as intended, the tablets were migrated and the tserver was 
> stopped:
> {noformat}
> 2016-08-17 15:24:20,871 [master.EventCoordinator] INFO : Tablet Server 
> shutdown requested for 
> jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]
> 2016-08-17 15:24:20,991 [master.Master] DEBUG: FATE op shutting down 
> jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c] finished
> {noformat}
> However, after this point, the master did not resume balancing:
> {noformat}
> 2016-08-17 15:24:31,024 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.02 seconds
> 2016-08-17 15:24:31,024 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> 2016-08-17 15:24:36,831 [replication.WorkDriver] DEBUG: Sleeping 3 ms 
> before next work assignment
> 2016-08-17 15:24:41,074 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.05 seconds
> 2016-08-17 15:24:41,083 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> 2016-08-17 15:24:51,134 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.05 seconds
> 2016-08-17 15:24:51,135 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> {noformat}
> Even after I brought a new tserver online on that host, the master still did 
> not resume balancing:
> {noformat}
> 2016-08-17 15:25:53,015 [master.Master] INFO : New servers: 
> [jelser-accumulo-180-4.openstacklocal:54722[2568579a5c3006e]]
> 2016-08-17 15:25:53,026 [master.EventCoordinator] INFO : There are now 5 
> tablet servers
> 2016-08-17 15:25:53,096 [master.Master] DEBUG: Finished gathering information 
> from 5 servers in 0.06 seconds
> 2016-08-17 15:25:53,109 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ACCUMULO-4410) Master didn't not resume balancing after administrative tserver shutdown

2018-11-27 Thread Ivan Bella (JIRA)


 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella updated ACCUMULO-4410:
-
Fix Version/s: 1.9.0
Affects Version/s: 2.0.0
   1.9.0
   Status: Patch Available  (was: In Progress)

https://github.com/apache/accumulo/pull/781

> Master didn't not resume balancing after administrative tserver shutdown
> 
>
> Key: ACCUMULO-4410
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4410
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.9.0, 2.0.0, 1.8.0
>Reporter: Josh Elser
>Assignee: Ivan Bella
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.0.0, 1.9.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I realized that I misconfigured a property, so, I started manually stopping 
> each tabletserver (using {{accumulo admin stop }}).
> This worked as intended, the tablets were migrated and the tserver was 
> stopped:
> {noformat}
> 2016-08-17 15:24:20,871 [master.EventCoordinator] INFO : Tablet Server 
> shutdown requested for 
> jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]
> 2016-08-17 15:24:20,991 [master.Master] DEBUG: FATE op shutting down 
> jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c] finished
> {noformat}
> However, after this point, the master did not resume balancing:
> {noformat}
> 2016-08-17 15:24:31,024 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.02 seconds
> 2016-08-17 15:24:31,024 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> 2016-08-17 15:24:36,831 [replication.WorkDriver] DEBUG: Sleeping 3 ms 
> before next work assignment
> 2016-08-17 15:24:41,074 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.05 seconds
> 2016-08-17 15:24:41,083 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> 2016-08-17 15:24:51,134 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.05 seconds
> 2016-08-17 15:24:51,135 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> {noformat}
> Even after I brought a new tserver online on that host, the master still did 
> not resume balancing:
> {noformat}
> 2016-08-17 15:25:53,015 [master.Master] INFO : New servers: 
> [jelser-accumulo-180-4.openstacklocal:54722[2568579a5c3006e]]
> 2016-08-17 15:25:53,026 [master.EventCoordinator] INFO : There are now 5 
> tablet servers
> 2016-08-17 15:25:53,096 [master.Master] DEBUG: Finished gathering information 
> from 5 servers in 0.06 seconds
> 2016-08-17 15:25:53,109 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (ACCUMULO-4410) Master didn't not resume balancing after administrative tserver shutdown

2018-11-27 Thread Ivan Bella (JIRA)


 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella updated ACCUMULO-4410:
-
Comment: was deleted

(was: See https://github.com/apache/accumulo/pull/781)

> Master didn't not resume balancing after administrative tserver shutdown
> 
>
> Key: ACCUMULO-4410
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4410
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.8.0, 1.9.0, 2.0.0
>Reporter: Josh Elser
>Assignee: Ivan Bella
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.9.0, 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I realized that I misconfigured a property, so, I started manually stopping 
> each tabletserver (using {{accumulo admin stop }}).
> This worked as intended, the tablets were migrated and the tserver was 
> stopped:
> {noformat}
> 2016-08-17 15:24:20,871 [master.EventCoordinator] INFO : Tablet Server 
> shutdown requested for 
> jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]
> 2016-08-17 15:24:20,991 [master.Master] DEBUG: FATE op shutting down 
> jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c] finished
> {noformat}
> However, after this point, the master did not resume balancing:
> {noformat}
> 2016-08-17 15:24:31,024 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.02 seconds
> 2016-08-17 15:24:31,024 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> 2016-08-17 15:24:36,831 [replication.WorkDriver] DEBUG: Sleeping 3 ms 
> before next work assignment
> 2016-08-17 15:24:41,074 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.05 seconds
> 2016-08-17 15:24:41,083 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> 2016-08-17 15:24:51,134 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.05 seconds
> 2016-08-17 15:24:51,135 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> {noformat}
> Even after I brought a new tserver online on that host, the master still did 
> not resume balancing:
> {noformat}
> 2016-08-17 15:25:53,015 [master.Master] INFO : New servers: 
> [jelser-accumulo-180-4.openstacklocal:54722[2568579a5c3006e]]
> 2016-08-17 15:25:53,026 [master.EventCoordinator] INFO : There are now 5 
> tablet servers
> 2016-08-17 15:25:53,096 [master.Master] DEBUG: Finished gathering information 
> from 5 servers in 0.06 seconds
> 2016-08-17 15:25:53,109 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ACCUMULO-4410) Master didn't not resume balancing after administrative tserver shutdown

2018-11-27 Thread Ivan Bella (JIRA)


[ 
https://issues.apache.org/jira/browse/ACCUMULO-4410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700966#comment-16700966
 ] 

Ivan Bella commented on ACCUMULO-4410:
--

See https://github.com/apache/accumulo/pull/781

> Master didn't not resume balancing after administrative tserver shutdown
> 
>
> Key: ACCUMULO-4410
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4410
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.8.0
>Reporter: Josh Elser
>Assignee: Ivan Bella
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I realized that I misconfigured a property, so, I started manually stopping 
> each tabletserver (using {{accumulo admin stop }}).
> This worked as intended, the tablets were migrated and the tserver was 
> stopped:
> {noformat}
> 2016-08-17 15:24:20,871 [master.EventCoordinator] INFO : Tablet Server 
> shutdown requested for 
> jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]
> 2016-08-17 15:24:20,991 [master.Master] DEBUG: FATE op shutting down 
> jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c] finished
> {noformat}
> However, after this point, the master did not resume balancing:
> {noformat}
> 2016-08-17 15:24:31,024 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.02 seconds
> 2016-08-17 15:24:31,024 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> 2016-08-17 15:24:36,831 [replication.WorkDriver] DEBUG: Sleeping 3 ms 
> before next work assignment
> 2016-08-17 15:24:41,074 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.05 seconds
> 2016-08-17 15:24:41,083 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> 2016-08-17 15:24:51,134 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.05 seconds
> 2016-08-17 15:24:51,135 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> {noformat}
> Even after I brought a new tserver online on that host, the master still did 
> not resume balancing:
> {noformat}
> 2016-08-17 15:25:53,015 [master.Master] INFO : New servers: 
> [jelser-accumulo-180-4.openstacklocal:54722[2568579a5c3006e]]
> 2016-08-17 15:25:53,026 [master.EventCoordinator] INFO : There are now 5 
> tablet servers
> 2016-08-17 15:25:53,096 [master.Master] DEBUG: Finished gathering information 
> from 5 servers in 0.06 seconds
> 2016-08-17 15:25:53,109 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ACCUMULO-4410) Master didn't not resume balancing after administrative tserver shutdown

2018-11-26 Thread Ivan Bella (JIRA)


[ 
https://issues.apache.org/jira/browse/ACCUMULO-4410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699004#comment-16699004
 ] 

Ivan Bella commented on ACCUMULO-4410:
--

The problem centers around the serversToShutdown set which is maintained in the 
master.  When we do an admin stop, the server is added to this list.  Once the 
tserver scan cycle determines that the server is down, it is removed back out 
of this list.  However an EventCoordinator call will subsequently put the 
tserver right back into the list and then it is never removed.  I am working 
through the logic to determine a clean fix to this.

> Master didn't not resume balancing after administrative tserver shutdown
> 
>
> Key: ACCUMULO-4410
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4410
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.8.0
>Reporter: Josh Elser
>Assignee: Ivan Bella
>Priority: Critical
> Fix For: 2.0.0
>
>
> I realized that I misconfigured a property, so, I started manually stopping 
> each tabletserver (using {{accumulo admin stop }}).
> This worked as intended, the tablets were migrated and the tserver was 
> stopped:
> {noformat}
> 2016-08-17 15:24:20,871 [master.EventCoordinator] INFO : Tablet Server 
> shutdown requested for 
> jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]
> 2016-08-17 15:24:20,991 [master.Master] DEBUG: FATE op shutting down 
> jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c] finished
> {noformat}
> However, after this point, the master did not resume balancing:
> {noformat}
> 2016-08-17 15:24:31,024 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.02 seconds
> 2016-08-17 15:24:31,024 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> 2016-08-17 15:24:36,831 [replication.WorkDriver] DEBUG: Sleeping 3 ms 
> before next work assignment
> 2016-08-17 15:24:41,074 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.05 seconds
> 2016-08-17 15:24:41,083 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> 2016-08-17 15:24:51,134 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.05 seconds
> 2016-08-17 15:24:51,135 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> {noformat}
> Even after I brought a new tserver online on that host, the master still did 
> not resume balancing:
> {noformat}
> 2016-08-17 15:25:53,015 [master.Master] INFO : New servers: 
> [jelser-accumulo-180-4.openstacklocal:54722[2568579a5c3006e]]
> 2016-08-17 15:25:53,026 [master.EventCoordinator] INFO : There are now 5 
> tablet servers
> 2016-08-17 15:25:53,096 [master.Master] DEBUG: Finished gathering information 
> from 5 servers in 0.06 seconds
> 2016-08-17 15:25:53,109 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ACCUMULO-4410) Master didn't not resume balancing after administrative tserver shutdown

2018-11-26 Thread Ivan Bella (JIRA)


[ 
https://issues.apache.org/jira/browse/ACCUMULO-4410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698996#comment-16698996
 ] 

Ivan Bella commented on ACCUMULO-4410:
--

This is causing issues for us because we routinely stop tservers for 
maintenance.  If we use the stop-here.sh script (which invokes the admin stop), 
then the master will get into this situation.  If we kill the tserver instead, 
then we have lease exceptions that we have to manually handle.  So this has 
become critical for us to fix.

> Master didn't not resume balancing after administrative tserver shutdown
> 
>
> Key: ACCUMULO-4410
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4410
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.8.0
>Reporter: Josh Elser
>Assignee: Ivan Bella
>Priority: Critical
> Fix For: 2.0.0
>
>
> I realized that I misconfigured a property, so, I started manually stopping 
> each tabletserver (using {{accumulo admin stop }}).
> This worked as intended, the tablets were migrated and the tserver was 
> stopped:
> {noformat}
> 2016-08-17 15:24:20,871 [master.EventCoordinator] INFO : Tablet Server 
> shutdown requested for 
> jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]
> 2016-08-17 15:24:20,991 [master.Master] DEBUG: FATE op shutting down 
> jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c] finished
> {noformat}
> However, after this point, the master did not resume balancing:
> {noformat}
> 2016-08-17 15:24:31,024 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.02 seconds
> 2016-08-17 15:24:31,024 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> 2016-08-17 15:24:36,831 [replication.WorkDriver] DEBUG: Sleeping 3 ms 
> before next work assignment
> 2016-08-17 15:24:41,074 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.05 seconds
> 2016-08-17 15:24:41,083 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> 2016-08-17 15:24:51,134 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.05 seconds
> 2016-08-17 15:24:51,135 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> {noformat}
> Even after I brought a new tserver online on that host, the master still did 
> not resume balancing:
> {noformat}
> 2016-08-17 15:25:53,015 [master.Master] INFO : New servers: 
> [jelser-accumulo-180-4.openstacklocal:54722[2568579a5c3006e]]
> 2016-08-17 15:25:53,026 [master.EventCoordinator] INFO : There are now 5 
> tablet servers
> 2016-08-17 15:25:53,096 [master.Master] DEBUG: Finished gathering information 
> from 5 servers in 0.06 seconds
> 2016-08-17 15:25:53,109 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ACCUMULO-4410) Master didn't not resume balancing after administrative tserver shutdown

2018-11-26 Thread Ivan Bella (JIRA)


 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella reassigned ACCUMULO-4410:


Assignee: Ivan Bella

> Master didn't not resume balancing after administrative tserver shutdown
> 
>
> Key: ACCUMULO-4410
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4410
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.8.0
>Reporter: Josh Elser
>Assignee: Ivan Bella
>Priority: Critical
> Fix For: 2.0.0
>
>
> I realized that I misconfigured a property, so, I started manually stopping 
> each tabletserver (using {{accumulo admin stop }}).
> This worked as intended, the tablets were migrated and the tserver was 
> stopped:
> {noformat}
> 2016-08-17 15:24:20,871 [master.EventCoordinator] INFO : Tablet Server 
> shutdown requested for 
> jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]
> 2016-08-17 15:24:20,991 [master.Master] DEBUG: FATE op shutting down 
> jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c] finished
> {noformat}
> However, after this point, the master did not resume balancing:
> {noformat}
> 2016-08-17 15:24:31,024 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.02 seconds
> 2016-08-17 15:24:31,024 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> 2016-08-17 15:24:36,831 [replication.WorkDriver] DEBUG: Sleeping 3 ms 
> before next work assignment
> 2016-08-17 15:24:41,074 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.05 seconds
> 2016-08-17 15:24:41,083 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> 2016-08-17 15:24:51,134 [master.Master] DEBUG: Finished gathering information 
> from 4 servers in 0.05 seconds
> 2016-08-17 15:24:51,135 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> {noformat}
> Even after I brought a new tserver online on that host, the master still did 
> not resume balancing:
> {noformat}
> 2016-08-17 15:25:53,015 [master.Master] INFO : New servers: 
> [jelser-accumulo-180-4.openstacklocal:54722[2568579a5c3006e]]
> 2016-08-17 15:25:53,026 [master.EventCoordinator] INFO : There are now 5 
> tablet servers
> 2016-08-17 15:25:53,096 [master.Master] DEBUG: Finished gathering information 
> from 5 servers in 0.06 seconds
> 2016-08-17 15:25:53,109 [master.Master] DEBUG: not balancing while shutting 
> down servers [jelser-accumulo-180-4.openstacklocal:59540[1568579a4b4004c]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ACCUMULO-4854) DfsLogger exceptions are too verbose

2018-09-04 Thread Ivan Bella (JIRA)


 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella resolved ACCUMULO-4854.
--
Resolution: Fixed

> DfsLogger exceptions are too verbose
> 
>
> Key: ACCUMULO-4854
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4854
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.9.0, 2.0.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The DfsLogger is constantly spewing exceptions to the logs when the logs are 
> simply being closed.  This is a normal situation when wal logs roll over.  
> The exceptions should not be showing as warnings in the accumulo monitor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ACCUMULO-4854) DfsLogger exceptions are too verbose

2018-08-25 Thread Ivan Bella (JIRA)


[ 
https://issues.apache.org/jira/browse/ACCUMULO-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592735#comment-16592735
 ] 

Ivan Bella commented on ACCUMULO-4854:
--

Created pull request [https://github.com/apache/accumulo/pull/616]

 

> DfsLogger exceptions are too verbose
> 
>
> Key: ACCUMULO-4854
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4854
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.9.0, 2.0.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>Priority: Minor
> Fix For: 2.0.0
>
>
> The DfsLogger is constantly spewing exceptions to the logs when the logs are 
> simply being closed.  This is a normal situation when wal logs roll over.  
> The exceptions should not be showing as warnings in the accumulo monitor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ACCUMULO-4854) DfsLogger exceptions are too verbose

2018-08-25 Thread Ivan Bella (JIRA)
Ivan Bella created ACCUMULO-4854:


 Summary: DfsLogger exceptions are too verbose
 Key: ACCUMULO-4854
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4854
 Project: Accumulo
  Issue Type: Improvement
  Components: tserver
Affects Versions: 1.9.0, 2.0.0
Reporter: Ivan Bella
Assignee: Ivan Bella
 Fix For: 2.0.0


The DfsLogger is constantly spewing exceptions to the logs when the logs are 
simply being closed.  This is a normal situation when wal logs roll over.  The 
exceptions should not be showing as warnings in the accumulo monitor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ACCUMULO-4832) Seeing warnings when write ahead log changes.

2018-02-27 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella resolved ACCUMULO-4832.
--
Resolution: Fixed

Fixes applied to 1.7, and merged into 1.8 and master

> Seeing warnings when write ahead log changes.
> -
>
> Key: ACCUMULO-4832
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4832
> Project: Accumulo
>  Issue Type: Bug
>Reporter: Keith Turner
>Assignee: Ivan Bella
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.7.4, 1.9.0, 2.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> While running continuous ingest against 1.7.4-rc0 I saw a lot of warning like 
> the followng.
> {noformat}
> 2018-02-26 17:51:58,189 [log.TabletServerLogger] WARN : Logs closed while 
> writing, retrying attempt 1 (suppressing retry messages for 18ms)
> 2018-02-26 17:51:58,724 [log.TabletServerLogger] WARN : Logs closed while 
> writing, retrying attempt 1 (suppressing retry messages for 18ms)
> 2018-02-26 17:51:58,940 [log.TabletServerLogger] WARN : Logs closed while 
> writing, retrying attempt 1 (suppressing retry messages for 18ms)
> 2018-02-26 17:51:59,226 [log.TabletServerLogger] WARN : Logs closed while 
> writing, retrying attempt 1 (suppressing retry messages for 18ms)
> 2018-02-26 17:51:59,227 [log.TabletServerLogger] WARN : Logs closed while 
> writing, retrying attempt 1 (suppressing retry messages for 18ms)
> 2018-02-26 17:51:59,227 [log.TabletServerLogger] WARN : Logs closed while 
> writing, retrying attempt 1 (suppressing retry messages for 18ms)
> {noformat}
>  
> The warning are generated by [TabletServerLogger.java line 
> 341|https://github.com/apache/accumulo/blob/4e91215f101362ef206e9f213b4d8d12b3f6e0e2/server/tserver/src/main/java/org/apache/accumulo/tserver/log/TabletServerLogger.java#L341]
>  when a write ahead log is closed.  Write ahead logs are closed as part of 
> normal operations as seen on [TabletServerLogger.java line 
> 386|https://github.com/apache/accumulo/blob/4e91215f101362ef206e9f213b4d8d12b3f6e0e2/server/tserver/src/main/java/org/apache/accumulo/tserver/log/TabletServerLogger.java#L386].
>   There should not be a warning when this happens.  This is caused by changes 
> made for ACCUMULO-4777.  Before these changes this event was logged at debug. 
>  At this time, these changes have not been released. It would be nice to fix 
> this before releasing 1.7.4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ACCUMULO-4832) Seeing warnings when write ahead log changes.

2018-02-26 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella reassigned ACCUMULO-4832:


Assignee: Ivan Bella

> Seeing warnings when write ahead log changes.
> -
>
> Key: ACCUMULO-4832
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4832
> Project: Accumulo
>  Issue Type: Bug
>Reporter: Keith Turner
>Assignee: Ivan Bella
>Priority: Blocker
> Fix For: 1.7.4, 1.9.0, 2.0.0
>
>
> While running continuous ingest against 1.7.4-rc0 I saw a lot of warning like 
> the followng.
> {noformat}
> 2018-02-26 17:51:58,189 [log.TabletServerLogger] WARN : Logs closed while 
> writing, retrying attempt 1 (suppressing retry messages for 18ms)
> 2018-02-26 17:51:58,724 [log.TabletServerLogger] WARN : Logs closed while 
> writing, retrying attempt 1 (suppressing retry messages for 18ms)
> 2018-02-26 17:51:58,940 [log.TabletServerLogger] WARN : Logs closed while 
> writing, retrying attempt 1 (suppressing retry messages for 18ms)
> 2018-02-26 17:51:59,226 [log.TabletServerLogger] WARN : Logs closed while 
> writing, retrying attempt 1 (suppressing retry messages for 18ms)
> 2018-02-26 17:51:59,227 [log.TabletServerLogger] WARN : Logs closed while 
> writing, retrying attempt 1 (suppressing retry messages for 18ms)
> 2018-02-26 17:51:59,227 [log.TabletServerLogger] WARN : Logs closed while 
> writing, retrying attempt 1 (suppressing retry messages for 18ms)
> {noformat}
>  
> The warning are generated by [TabletServerLogger.java line 
> 341|https://github.com/apache/accumulo/blob/4e91215f101362ef206e9f213b4d8d12b3f6e0e2/server/tserver/src/main/java/org/apache/accumulo/tserver/log/TabletServerLogger.java#L341]
>  when a write ahead log is closed.  Write ahead logs are closed as part of 
> normal operations as seen on [TabletServerLogger.java line 
> 386|https://github.com/apache/accumulo/blob/4e91215f101362ef206e9f213b4d8d12b3f6e0e2/server/tserver/src/main/java/org/apache/accumulo/tserver/log/TabletServerLogger.java#L386].
>   There should not be a warning when this happens.  This is caused by changes 
> made for ACCUMULO-4777.  Before these changes this event was logged at debug. 
>  At this time, these changes have not been released. It would be nice to fix 
> this before releasing 1.7.4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries

2018-02-01 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella resolved ACCUMULO-4777.
--
Resolution: Fixed

Commits put into branches 1.7, 1.8, and 2.0

> Root tablet got spammed with 1.8 million log entries
> 
>
> Key: ACCUMULO-4777
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
> Project: Accumulo
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.7.4, 1.9.0, 2.0.0
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> We had a tserver that was handling accumulo.metadata tablets that somehow got 
> into a loop where it created over 22K empty wal logs.  There were around 70 
> metadata tablets and this resulted in around 1.8 million log entries in added 
> to the accumulo.root table.  The only reason it stopped creating wal logs is 
> because it ran out of open file handles.  This took us many hours and cups of 
> coffee to clean up.
> The log contained the following messages in a tight loop:
> log.TabletServerLogger INFO : Using next log hdfs://...
> tserver.TabletServfer INFO : Writing log marker for hdfs://...
> tserver.TabletServer INFO : Marking hdfs://... closed
> log.DfsLogger INFO : Slow sync cost ...
> ...
> Unfortunately we did not have DEBUG turned on so we have no debug messages.
> Tracking through the code there are three places where the 
> TabletServerLogger.close method is called:
> 1) via resetLoggers in the TabletServerLogger, but nothing calls this method 
> so this is ruled out
> 2) when the log gets too large or too old, but neither of those checks should 
> have been hitting here.
> 3) In a loop that is executed (while (!success)) in the 
> TabletServerLogger.write method.  In this case when we unsuccessfullty write 
> something to the wal, then that one is closed and a new one is created.  This 
> loop will go forever until we successfully write out the entry.  A 
> DfsLogger.LogClosedException seems the most logical reason.  This is most 
> likely because a ClosedChannelException was thrown from the DfsLogger.write 
> methods (around line 609 in DfsLogger).
> So the root cause was most likely hadoop related.  However in accumulo we 
> probably should not be doing a tight retry loop around a hadoop failure.  I 
> recommend at a minimum doing some sort of exponential back off and perhaps 
> setting a limit on the number of retries resulting in a critical tserver 
> failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries

2018-01-26 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella reopened ACCUMULO-4777:
--

We now get the message "Failed to write to WAL, retrying attempt 0" way too 
often because of a normal situation where the log was closed by another thread. 
 Changing the code back to only show this message after the first attempt.

> Root tablet got spammed with 1.8 million log entries
> 
>
> Key: ACCUMULO-4777
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
> Project: Accumulo
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.7.4, 1.9.0, 2.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> We had a tserver that was handling accumulo.metadata tablets that somehow got 
> into a loop where it created over 22K empty wal logs.  There were around 70 
> metadata tablets and this resulted in around 1.8 million log entries in added 
> to the accumulo.root table.  The only reason it stopped creating wal logs is 
> because it ran out of open file handles.  This took us many hours and cups of 
> coffee to clean up.
> The log contained the following messages in a tight loop:
> log.TabletServerLogger INFO : Using next log hdfs://...
> tserver.TabletServfer INFO : Writing log marker for hdfs://...
> tserver.TabletServer INFO : Marking hdfs://... closed
> log.DfsLogger INFO : Slow sync cost ...
> ...
> Unfortunately we did not have DEBUG turned on so we have no debug messages.
> Tracking through the code there are three places where the 
> TabletServerLogger.close method is called:
> 1) via resetLoggers in the TabletServerLogger, but nothing calls this method 
> so this is ruled out
> 2) when the log gets too large or too old, but neither of those checks should 
> have been hitting here.
> 3) In a loop that is executed (while (!success)) in the 
> TabletServerLogger.write method.  In this case when we unsuccessfullty write 
> something to the wal, then that one is closed and a new one is created.  This 
> loop will go forever until we successfully write out the entry.  A 
> DfsLogger.LogClosedException seems the most logical reason.  This is most 
> likely because a ClosedChannelException was thrown from the DfsLogger.write 
> methods (around line 609 in DfsLogger).
> So the root cause was most likely hadoop related.  However in accumulo we 
> probably should not be doing a tight retry loop around a hadoop failure.  I 
> recommend at a minimum doing some sort of exponential back off and perhaps 
> setting a limit on the number of retries resulting in a critical tserver 
> failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries

2018-01-25 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340216#comment-16340216
 ] 

Ivan Bella commented on ACCUMULO-4777:
--

I applied the changes to 1.7, merged into 1.8 and subsequently into master.

> Root tablet got spammed with 1.8 million log entries
> 
>
> Key: ACCUMULO-4777
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
> Project: Accumulo
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.7.4, 1.9.0, 2.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> We had a tserver that was handling accumulo.metadata tablets that somehow got 
> into a loop where it created over 22K empty wal logs.  There were around 70 
> metadata tablets and this resulted in around 1.8 million log entries in added 
> to the accumulo.root table.  The only reason it stopped creating wal logs is 
> because it ran out of open file handles.  This took us many hours and cups of 
> coffee to clean up.
> The log contained the following messages in a tight loop:
> log.TabletServerLogger INFO : Using next log hdfs://...
> tserver.TabletServfer INFO : Writing log marker for hdfs://...
> tserver.TabletServer INFO : Marking hdfs://... closed
> log.DfsLogger INFO : Slow sync cost ...
> ...
> Unfortunately we did not have DEBUG turned on so we have no debug messages.
> Tracking through the code there are three places where the 
> TabletServerLogger.close method is called:
> 1) via resetLoggers in the TabletServerLogger, but nothing calls this method 
> so this is ruled out
> 2) when the log gets too large or too old, but neither of those checks should 
> have been hitting here.
> 3) In a loop that is executed (while (!success)) in the 
> TabletServerLogger.write method.  In this case when we unsuccessfullty write 
> something to the wal, then that one is closed and a new one is created.  This 
> loop will go forever until we successfully write out the entry.  A 
> DfsLogger.LogClosedException seems the most logical reason.  This is most 
> likely because a ClosedChannelException was thrown from the DfsLogger.write 
> methods (around line 609 in DfsLogger).
> So the root cause was most likely hadoop related.  However in accumulo we 
> probably should not be doing a tight retry loop around a hadoop failure.  I 
> recommend at a minimum doing some sort of exponential back off and perhaps 
> setting a limit on the number of retries resulting in a critical tserver 
> failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries

2018-01-25 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella resolved ACCUMULO-4777.
--
Resolution: Fixed

> Root tablet got spammed with 1.8 million log entries
> 
>
> Key: ACCUMULO-4777
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
> Project: Accumulo
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.7.4, 1.9.0, 2.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> We had a tserver that was handling accumulo.metadata tablets that somehow got 
> into a loop where it created over 22K empty wal logs.  There were around 70 
> metadata tablets and this resulted in around 1.8 million log entries in added 
> to the accumulo.root table.  The only reason it stopped creating wal logs is 
> because it ran out of open file handles.  This took us many hours and cups of 
> coffee to clean up.
> The log contained the following messages in a tight loop:
> log.TabletServerLogger INFO : Using next log hdfs://...
> tserver.TabletServfer INFO : Writing log marker for hdfs://...
> tserver.TabletServer INFO : Marking hdfs://... closed
> log.DfsLogger INFO : Slow sync cost ...
> ...
> Unfortunately we did not have DEBUG turned on so we have no debug messages.
> Tracking through the code there are three places where the 
> TabletServerLogger.close method is called:
> 1) via resetLoggers in the TabletServerLogger, but nothing calls this method 
> so this is ruled out
> 2) when the log gets too large or too old, but neither of those checks should 
> have been hitting here.
> 3) In a loop that is executed (while (!success)) in the 
> TabletServerLogger.write method.  In this case when we unsuccessfullty write 
> something to the wal, then that one is closed and a new one is created.  This 
> loop will go forever until we successfully write out the entry.  A 
> DfsLogger.LogClosedException seems the most logical reason.  This is most 
> likely because a ClosedChannelException was thrown from the DfsLogger.write 
> methods (around line 609 in DfsLogger).
> So the root cause was most likely hadoop related.  However in accumulo we 
> probably should not be doing a tight retry loop around a hadoop failure.  I 
> recommend at a minimum doing some sort of exponential back off and perhaps 
> setting a limit on the number of retries resulting in a critical tserver 
> failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ACCUMULO-4783) WAL writes should handle IOExceptions vs other exceptions differently

2018-01-16 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella updated ACCUMULO-4783:
-
Affects Version/s: 1.7.3
   1.8.1

> WAL writes should handle IOExceptions vs other exceptions differently
> -
>
> Key: ACCUMULO-4783
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4783
> Project: Accumulo
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.7.3, 1.8.1, 2.0.0
>Reporter: Ivan Bella
>Priority: Major
> Fix For: 2.0.0
>
>
> The writing of WALs in the TabletServerLogger currently does not distinguish 
> between IOExceptions and other exceptions when backing off and retrying.  I 
> believe IOExceptions can be handled by backing off and retrying.  However 
> other exceptions perhaps should be handled by perhaps halting the tserver as 
> they may denote a coding error instead of some file system error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ACCUMULO-4783) WAL writes should handle IOExceptions vs other exceptions differently

2018-01-16 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella updated ACCUMULO-4783:
-
Fix Version/s: 2.0.0

> WAL writes should handle IOExceptions vs other exceptions differently
> -
>
> Key: ACCUMULO-4783
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4783
> Project: Accumulo
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.7.3, 1.8.1, 2.0.0
>Reporter: Ivan Bella
>Priority: Major
> Fix For: 2.0.0
>
>
> The writing of WALs in the TabletServerLogger currently does not distinguish 
> between IOExceptions and other exceptions when backing off and retrying.  I 
> believe IOExceptions can be handled by backing off and retrying.  However 
> other exceptions perhaps should be handled by perhaps halting the tserver as 
> they may denote a coding error instead of some file system error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ACCUMULO-4783) WAL writes should handle IOExceptions vs other exceptions differently

2018-01-16 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella updated ACCUMULO-4783:
-
Description: The writing of WALs in the TabletServerLogger currently does 
not distinguish between IOExceptions and other exceptions when backing off and 
retrying.  I believe IOExceptions can be handled by backing off and retrying.  
However other exceptions perhaps should be handled by perhaps halting the 
tserver as they may denote a coding error instead of some file system error.

> WAL writes should handle IOExceptions vs other exceptions differently
> -
>
> Key: ACCUMULO-4783
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4783
> Project: Accumulo
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 2.0.0
> Environment: The writing of WALs in the TabletServerLogger currently 
> does not distinguish between IOExceptions and other exceptions when backing 
> off and retrying.  I believe IOExceptions can be handled by backing off and 
> retrying.  However other exceptions perhaps should be handled by perhaps 
> halting the tserver as they may denote a coding error instead of some file 
> system error.
>Reporter: Ivan Bella
>Priority: Major
>
> The writing of WALs in the TabletServerLogger currently does not distinguish 
> between IOExceptions and other exceptions when backing off and retrying.  I 
> believe IOExceptions can be handled by backing off and retrying.  However 
> other exceptions perhaps should be handled by perhaps halting the tserver as 
> they may denote a coding error instead of some file system error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ACCUMULO-4783) WAL writes should handle IOExceptions vs other exceptions differently

2018-01-16 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella updated ACCUMULO-4783:
-
Environment: (was: The writing of WALs in the TabletServerLogger 
currently does not distinguish between IOExceptions and other exceptions when 
backing off and retrying.  I believe IOExceptions can be handled by backing off 
and retrying.  However other exceptions perhaps should be handled by perhaps 
halting the tserver as they may denote a coding error instead of some file 
system error.)

> WAL writes should handle IOExceptions vs other exceptions differently
> -
>
> Key: ACCUMULO-4783
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4783
> Project: Accumulo
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 2.0.0
>Reporter: Ivan Bella
>Priority: Major
>
> The writing of WALs in the TabletServerLogger currently does not distinguish 
> between IOExceptions and other exceptions when backing off and retrying.  I 
> believe IOExceptions can be handled by backing off and retrying.  However 
> other exceptions perhaps should be handled by perhaps halting the tserver as 
> they may denote a coding error instead of some file system error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ACCUMULO-4783) WAL writes should handle IOExceptions vs other exceptions differently

2018-01-16 Thread Ivan Bella (JIRA)
Ivan Bella created ACCUMULO-4783:


 Summary: WAL writes should handle IOExceptions vs other exceptions 
differently
 Key: ACCUMULO-4783
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4783
 Project: Accumulo
  Issue Type: Bug
  Components: tserver
Affects Versions: 2.0.0
 Environment: The writing of WALs in the TabletServerLogger currently 
does not distinguish between IOExceptions and other exceptions when backing off 
and retrying.  I believe IOExceptions can be handled by backing off and 
retrying.  However other exceptions perhaps should be handled by perhaps 
halting the tserver as they may denote a coding error instead of some file 
system error.
Reporter: Ivan Bella






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries

2018-01-16 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella updated ACCUMULO-4777:
-
Fix Version/s: 1.7.4
   1.8.2

> Root tablet got spammed with 1.8 million log entries
> 
>
> Key: ACCUMULO-4777
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
> Project: Accumulo
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.7.4, 1.9.0, 2.0.0, 1.8.2
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> We had a tserver that was handling accumulo.metadata tablets that somehow got 
> into a loop where it created over 22K empty wal logs.  There were around 70 
> metadata tablets and this resulted in around 1.8 million log entries in added 
> to the accumulo.root table.  The only reason it stopped creating wal logs is 
> because it ran out of open file handles.  This took us many hours and cups of 
> coffee to clean up.
> The log contained the following messages in a tight loop:
> log.TabletServerLogger INFO : Using next log hdfs://...
> tserver.TabletServfer INFO : Writing log marker for hdfs://...
> tserver.TabletServer INFO : Marking hdfs://... closed
> log.DfsLogger INFO : Slow sync cost ...
> ...
> Unfortunately we did not have DEBUG turned on so we have no debug messages.
> Tracking through the code there are three places where the 
> TabletServerLogger.close method is called:
> 1) via resetLoggers in the TabletServerLogger, but nothing calls this method 
> so this is ruled out
> 2) when the log gets too large or too old, but neither of those checks should 
> have been hitting here.
> 3) In a loop that is executed (while (!success)) in the 
> TabletServerLogger.write method.  In this case when we unsuccessfullty write 
> something to the wal, then that one is closed and a new one is created.  This 
> loop will go forever until we successfully write out the entry.  A 
> DfsLogger.LogClosedException seems the most logical reason.  This is most 
> likely because a ClosedChannelException was thrown from the DfsLogger.write 
> methods (around line 609 in DfsLogger).
> So the root cause was most likely hadoop related.  However in accumulo we 
> probably should not be doing a tight retry loop around a hadoop failure.  I 
> recommend at a minimum doing some sort of exponential back off and perhaps 
> setting a limit on the number of retries resulting in a critical tserver 
> failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries

2018-01-16 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella reassigned ACCUMULO-4777:


Assignee: Ivan Bella

> Root tablet got spammed with 1.8 million log entries
> 
>
> Key: ACCUMULO-4777
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
> Project: Accumulo
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.9.0, 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> We had a tserver that was handling accumulo.metadata tablets that somehow got 
> into a loop where it created over 22K empty wal logs.  There were around 70 
> metadata tablets and this resulted in around 1.8 million log entries in added 
> to the accumulo.root table.  The only reason it stopped creating wal logs is 
> because it ran out of open file handles.  This took us many hours and cups of 
> coffee to clean up.
> The log contained the following messages in a tight loop:
> log.TabletServerLogger INFO : Using next log hdfs://...
> tserver.TabletServfer INFO : Writing log marker for hdfs://...
> tserver.TabletServer INFO : Marking hdfs://... closed
> log.DfsLogger INFO : Slow sync cost ...
> ...
> Unfortunately we did not have DEBUG turned on so we have no debug messages.
> Tracking through the code there are three places where the 
> TabletServerLogger.close method is called:
> 1) via resetLoggers in the TabletServerLogger, but nothing calls this method 
> so this is ruled out
> 2) when the log gets too large or too old, but neither of those checks should 
> have been hitting here.
> 3) In a loop that is executed (while (!success)) in the 
> TabletServerLogger.write method.  In this case when we unsuccessfullty write 
> something to the wal, then that one is closed and a new one is created.  This 
> loop will go forever until we successfully write out the entry.  A 
> DfsLogger.LogClosedException seems the most logical reason.  This is most 
> likely because a ClosedChannelException was thrown from the DfsLogger.write 
> methods (around line 609 in DfsLogger).
> So the root cause was most likely hadoop related.  However in accumulo we 
> probably should not be doing a tight retry loop around a hadoop failure.  I 
> recommend at a minimum doing some sort of exponential back off and perhaps 
> setting a limit on the number of retries resulting in a critical tserver 
> failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries

2018-01-12 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324162#comment-16324162
 ] 

Ivan Bella commented on ACCUMULO-4777:
--

so if we do an overflow check on that sequence, what would we do?  Depending on 
a continuous sequence anywhere seems like a process destined to eventually fail.

> Root tablet got spammed with 1.8 million log entries
> 
>
> Key: ACCUMULO-4777
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
> Project: Accumulo
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.9.0, 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We had a tserver that was handling accumulo.metadata tablets that somehow got 
> into a loop where it created over 22K empty wal logs.  There were around 70 
> metadata tablets and this resulted in around 1.8 million log entries in added 
> to the accumulo.root table.  The only reason it stopped creating wal logs is 
> because it ran out of open file handles.  This took us many hours and cups of 
> coffee to clean up.
> The log contained the following messages in a tight loop:
> log.TabletServerLogger INFO : Using next log hdfs://...
> tserver.TabletServfer INFO : Writing log marker for hdfs://...
> tserver.TabletServer INFO : Marking hdfs://... closed
> log.DfsLogger INFO : Slow sync cost ...
> ...
> Unfortunately we did not have DEBUG turned on so we have no debug messages.
> Tracking through the code there are three places where the 
> TabletServerLogger.close method is called:
> 1) via resetLoggers in the TabletServerLogger, but nothing calls this method 
> so this is ruled out
> 2) when the log gets too large or too old, but neither of those checks should 
> have been hitting here.
> 3) In a loop that is executed (while (!success)) in the 
> TabletServerLogger.write method.  In this case when we unsuccessfullty write 
> something to the wal, then that one is closed and a new one is created.  This 
> loop will go forever until we successfully write out the entry.  A 
> DfsLogger.LogClosedException seems the most logical reason.  This is most 
> likely because a ClosedChannelException was thrown from the DfsLogger.write 
> methods (around line 609 in DfsLogger).
> So the root cause was most likely hadoop related.  However in accumulo we 
> probably should not be doing a tight retry loop around a hadoop failure.  I 
> recommend at a minimum doing some sort of exponential back off and perhaps 
> setting a limit on the number of retries resulting in a critical tserver 
> failure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries

2018-01-12 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324086#comment-16324086
 ] 

Ivan Bella commented on ACCUMULO-4777:
--

[~kturner] I believe the sequence you are referring to is pulled from 
CommitSession.getWALogSeq() which is populated from the nextSeq int in 
TabletMemory.

> Root tablet got spammed with 1.8 million log entries
> 
>
> Key: ACCUMULO-4777
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
> Project: Accumulo
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.9.0, 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We had a tserver that was handling accumulo.metadata tablets that somehow got 
> into a loop where it created over 22K empty wal logs.  There were around 70 
> metadata tablets and this resulted in around 1.8 million log entries in added 
> to the accumulo.root table.  The only reason it stopped creating wal logs is 
> because it ran out of open file handles.  This took us many hours and cups of 
> coffee to clean up.
> The log contained the following messages in a tight loop:
> log.TabletServerLogger INFO : Using next log hdfs://...
> tserver.TabletServfer INFO : Writing log marker for hdfs://...
> tserver.TabletServer INFO : Marking hdfs://... closed
> log.DfsLogger INFO : Slow sync cost ...
> ...
> Unfortunately we did not have DEBUG turned on so we have no debug messages.
> Tracking through the code there are three places where the 
> TabletServerLogger.close method is called:
> 1) via resetLoggers in the TabletServerLogger, but nothing calls this method 
> so this is ruled out
> 2) when the log gets too large or too old, but neither of those checks should 
> have been hitting here.
> 3) In a loop that is executed (while (!success)) in the 
> TabletServerLogger.write method.  In this case when we unsuccessfullty write 
> something to the wal, then that one is closed and a new one is created.  This 
> loop will go forever until we successfully write out the entry.  A 
> DfsLogger.LogClosedException seems the most logical reason.  This is most 
> likely because a ClosedChannelException was thrown from the DfsLogger.write 
> methods (around line 609 in DfsLogger).
> So the root cause was most likely hadoop related.  However in accumulo we 
> probably should not be doing a tight retry loop around a hadoop failure.  I 
> recommend at a minimum doing some sort of exponential back off and perhaps 
> setting a limit on the number of retries resulting in a critical tserver 
> failure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries

2018-01-12 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324069#comment-16324069
 ] 

Ivan Bella commented on ACCUMULO-4777:
--

BTW An option here is to implement the backoff mechanism against a separate 
ticket so that we can get the unused sequence generation mechanism removed 
immediately.

> Root tablet got spammed with 1.8 million log entries
> 
>
> Key: ACCUMULO-4777
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
> Project: Accumulo
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.9.0, 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We had a tserver that was handling accumulo.metadata tablets that somehow got 
> into a loop where it created over 22K empty wal logs.  There were around 70 
> metadata tablets and this resulted in around 1.8 million log entries in added 
> to the accumulo.root table.  The only reason it stopped creating wal logs is 
> because it ran out of open file handles.  This took us many hours and cups of 
> coffee to clean up.
> The log contained the following messages in a tight loop:
> log.TabletServerLogger INFO : Using next log hdfs://...
> tserver.TabletServfer INFO : Writing log marker for hdfs://...
> tserver.TabletServer INFO : Marking hdfs://... closed
> log.DfsLogger INFO : Slow sync cost ...
> ...
> Unfortunately we did not have DEBUG turned on so we have no debug messages.
> Tracking through the code there are three places where the 
> TabletServerLogger.close method is called:
> 1) via resetLoggers in the TabletServerLogger, but nothing calls this method 
> so this is ruled out
> 2) when the log gets too large or too old, but neither of those checks should 
> have been hitting here.
> 3) In a loop that is executed (while (!success)) in the 
> TabletServerLogger.write method.  In this case when we unsuccessfullty write 
> something to the wal, then that one is closed and a new one is created.  This 
> loop will go forever until we successfully write out the entry.  A 
> DfsLogger.LogClosedException seems the most logical reason.  This is most 
> likely because a ClosedChannelException was thrown from the DfsLogger.write 
> methods (around line 609 in DfsLogger).
> So the root cause was most likely hadoop related.  However in accumulo we 
> probably should not be doing a tight retry loop around a hadoop failure.  I 
> recommend at a minimum doing some sort of exponential back off and perhaps 
> setting a limit on the number of retries resulting in a critical tserver 
> failure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries

2018-01-12 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324050#comment-16324050
 ] 

Ivan Bella commented on ACCUMULO-4777:
--

I updated the pull request with a backoff mechanism and termination criteria 
when failing to write to the WALs.  I used a mechanism parallel to the WAL 
creation backoff process.

> Root tablet got spammed with 1.8 million log entries
> 
>
> Key: ACCUMULO-4777
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
> Project: Accumulo
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.9.0, 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We had a tserver that was handling accumulo.metadata tablets that somehow got 
> into a loop where it created over 22K empty wal logs.  There were around 70 
> metadata tablets and this resulted in around 1.8 million log entries in added 
> to the accumulo.root table.  The only reason it stopped creating wal logs is 
> because it ran out of open file handles.  This took us many hours and cups of 
> coffee to clean up.
> The log contained the following messages in a tight loop:
> log.TabletServerLogger INFO : Using next log hdfs://...
> tserver.TabletServfer INFO : Writing log marker for hdfs://...
> tserver.TabletServer INFO : Marking hdfs://... closed
> log.DfsLogger INFO : Slow sync cost ...
> ...
> Unfortunately we did not have DEBUG turned on so we have no debug messages.
> Tracking through the code there are three places where the 
> TabletServerLogger.close method is called:
> 1) via resetLoggers in the TabletServerLogger, but nothing calls this method 
> so this is ruled out
> 2) when the log gets too large or too old, but neither of those checks should 
> have been hitting here.
> 3) In a loop that is executed (while (!success)) in the 
> TabletServerLogger.write method.  In this case when we unsuccessfullty write 
> something to the wal, then that one is closed and a new one is created.  This 
> loop will go forever until we successfully write out the entry.  A 
> DfsLogger.LogClosedException seems the most logical reason.  This is most 
> likely because a ClosedChannelException was thrown from the DfsLogger.write 
> methods (around line 609 in DfsLogger).
> So the root cause was most likely hadoop related.  However in accumulo we 
> probably should not be doing a tight retry loop around a hadoop failure.  I 
> recommend at a minimum doing some sort of exponential back off and perhaps 
> setting a limit on the number of retries resulting in a critical tserver 
> failure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries

2018-01-11 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323254#comment-16323254
 ] 

Ivan Bella commented on ACCUMULO-4777:
--

[~kturner] As far as I can tell this sequence generator value is not actually 
being used anywhere.  That may have been how it was used in the past, but no 
longer.  I created a pull request that strips that out.

> Root tablet got spammed with 1.8 million log entries
> 
>
> Key: ACCUMULO-4777
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
> Project: Accumulo
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.8.2, 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We had a tserver that was handling accumulo.metadata tablets that somehow got 
> into a loop where it created over 22K empty wal logs.  There were around 70 
> metadata tablets and this resulted in around 1.8 million log entries in added 
> to the accumulo.root table.  The only reason it stopped creating wal logs is 
> because it ran out of open file handles.  This took us many hours and cups of 
> coffee to clean up.
> The log contained the following messages in a tight loop:
> log.TabletServerLogger INFO : Using next log hdfs://...
> tserver.TabletServfer INFO : Writing log marker for hdfs://...
> tserver.TabletServer INFO : Marking hdfs://... closed
> log.DfsLogger INFO : Slow sync cost ...
> ...
> Unfortunately we did not have DEBUG turned on so we have no debug messages.
> Tracking through the code there are three places where the 
> TabletServerLogger.close method is called:
> 1) via resetLoggers in the TabletServerLogger, but nothing calls this method 
> so this is ruled out
> 2) when the log gets too large or too old, but neither of those checks should 
> have been hitting here.
> 3) In a loop that is executed (while (!success)) in the 
> TabletServerLogger.write method.  In this case when we unsuccessfullty write 
> something to the wal, then that one is closed and a new one is created.  This 
> loop will go forever until we successfully write out the entry.  A 
> DfsLogger.LogClosedException seems the most logical reason.  This is most 
> likely because a ClosedChannelException was thrown from the DfsLogger.write 
> methods (around line 609 in DfsLogger).
> So the root cause was most likely hadoop related.  However in accumulo we 
> probably should not be doing a tight retry loop around a hadoop failure.  I 
> recommend at a minimum doing some sort of exponential back off and perhaps 
> setting a limit on the number of retries resulting in a critical tserver 
> failure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries

2018-01-11 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16322739#comment-16322739
 ] 

Ivan Bella commented on ACCUMULO-4777:
--

After several days of getting my head around this code, I think I figured it 
out.  There is a AtomicInteger used as a sequence counter in the 
TabletServerLogger.  When this sequence counter wraps (goes negative), an 
exception is thrown.  However in the write method where it is thrown, it will 
subsequently close the current WAL, open a new one, and recursively call itself 
via the defineTablet method.  This underlying call will fail for the same 
reason, and then close the WAL, and recursively call it self again...etc, etc, 
etc.

So basically we have tablet servers that have been up long enough to actually 
incur over 2^31 writes into the WALs.  Once this happens, the server will go 
into this loop.  I am guessing that not many systems leave the tablet servers 
up long enough for this to happen.  Also, this is happening for us on tservers 
for which only the accumulo.metadata is pinned (via the HostRegexBalancer).  
Hence it is actually more likely to happen first on these tservers.

As far as I can tell, every path to this write method basically ignores the 
sequence number returned.  So what is the real purpose of this sequence 
generator?  I think I need to original authors of this code to tell me.  My 
inclination is to basically reset the sequence generator back to 0 and just 
continue.  Any thoughts out there on this?

> Root tablet got spammed with 1.8 million log entries
> 
>
> Key: ACCUMULO-4777
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
> Project: Accumulo
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Priority: Critical
> Fix For: 1.8.2, 2.0.0
>
>
> We had a tserver that was handling accumulo.metadata tablets that somehow got 
> into a loop where it created over 22K empty wal logs.  There were around 70 
> metadata tablets and this resulted in around 1.8 million log entries in added 
> to the accumulo.root table.  The only reason it stopped creating wal logs is 
> because it ran out of open file handles.  This took us many hours and cups of 
> coffee to clean up.
> The log contained the following messages in a tight loop:
> log.TabletServerLogger INFO : Using next log hdfs://...
> tserver.TabletServfer INFO : Writing log marker for hdfs://...
> tserver.TabletServer INFO : Marking hdfs://... closed
> log.DfsLogger INFO : Slow sync cost ...
> ...
> Unfortunately we did not have DEBUG turned on so we have no debug messages.
> Tracking through the code there are three places where the 
> TabletServerLogger.close method is called:
> 1) via resetLoggers in the TabletServerLogger, but nothing calls this method 
> so this is ruled out
> 2) when the log gets too large or too old, but neither of those checks should 
> have been hitting here.
> 3) In a loop that is executed (while (!success)) in the 
> TabletServerLogger.write method.  In this case when we unsuccessfullty write 
> something to the wal, then that one is closed and a new one is created.  This 
> loop will go forever until we successfully write out the entry.  A 
> DfsLogger.LogClosedException seems the most logical reason.  This is most 
> likely because a ClosedChannelException was thrown from the DfsLogger.write 
> methods (around line 609 in DfsLogger).
> So the root cause was most likely hadoop related.  However in accumulo we 
> probably should not be doing a tight retry loop around a hadoop failure.  I 
> recommend at a minimum doing some sort of exponential back off and perhaps 
> setting a limit on the number of retries resulting in a critical tserver 
> failure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries

2018-01-08 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16316356#comment-16316356
 ] 

Ivan Bella edited comment on ACCUMULO-4777 at 1/8/18 2:14 PM:
--

Both time there were stack overflow errors.  The stack overflow is basically as 
follows (accumulo 1.8.1):

All in TabletServerLogger:

defineTablet line 465
write line 382
write line 356
defineTablet line 465
write line 382
write line 356
...



was (Author: ivan.bella):
The stack overflow is basically as follows (accumulo 1.8.1):

All in TabletServerLogger:

defineTablet line 465
write line 382
write line 356
defineTablet line 465
write line 382
write line 356
...


> Root tablet got spammed with 1.8 million log entries
> 
>
> Key: ACCUMULO-4777
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
> Project: Accumulo
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Priority: Critical
>
> We had a tserver that was handling accumulo.metadata tablets that somehow got 
> into a loop where it created over 22K empty wal logs.  There were around 70 
> metadata tablets and this resulted in around 1.8 million log entries in added 
> to the accumulo.root table.  The only reason it stopped creating wal logs is 
> because it ran out of open file handles.  This took us many hours and cups of 
> coffee to clean up.
> The log contained the following messages in a tight loop:
> log.TabletServerLogger INFO : Using next log hdfs://...
> tserver.TabletServfer INFO : Writing log marker for hdfs://...
> tserver.TabletServer INFO : Marking hdfs://... closed
> log.DfsLogger INFO : Slow sync cost ...
> ...
> Unfortunately we did not have DEBUG turned on so we have no debug messages.
> Tracking through the code there are three places where the 
> TabletServerLogger.close method is called:
> 1) via resetLoggers in the TabletServerLogger, but nothing calls this method 
> so this is ruled out
> 2) when the log gets too large or too old, but neither of those checks should 
> have been hitting here.
> 3) In a loop that is executed (while (!success)) in the 
> TabletServerLogger.write method.  In this case when we unsuccessfullty write 
> something to the wal, then that one is closed and a new one is created.  This 
> loop will go forever until we successfully write out the entry.  A 
> DfsLogger.LogClosedException seems the most logical reason.  This is most 
> likely because a ClosedChannelException was thrown from the DfsLogger.write 
> methods (around line 609 in DfsLogger).
> So the root cause was most likely hadoop related.  However in accumulo we 
> probably should not be doing a tight retry loop around a hadoop failure.  I 
> recommend at a minimum doing some sort of exponential back off and perhaps 
> setting a limit on the number of retries resulting in a critical tserver 
> failure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries

2018-01-08 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16316337#comment-16316337
 ] 

Ivan Bella edited comment on ACCUMULO-4777 at 1/8/18 2:13 PM:
--

This happened to us again, however this time everything appeared to recover.  
This time we had debug on so we are analyzing the logs to try and determine how 
it gets into this state in the first place.


was (Author: ivan.bella):
This happened to us again, however this time everything appeared to recover .  
This time the loop appeared to terminate with a stack overflow error instead of 
running out of file descriptors first which may have allowed the tserver to 
remedy the situation earlier.  Also we had debug on so we are analyzing the 
logs to try and determine how it gets into this state in the first place.

> Root tablet got spammed with 1.8 million log entries
> 
>
> Key: ACCUMULO-4777
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
> Project: Accumulo
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Priority: Critical
>
> We had a tserver that was handling accumulo.metadata tablets that somehow got 
> into a loop where it created over 22K empty wal logs.  There were around 70 
> metadata tablets and this resulted in around 1.8 million log entries in added 
> to the accumulo.root table.  The only reason it stopped creating wal logs is 
> because it ran out of open file handles.  This took us many hours and cups of 
> coffee to clean up.
> The log contained the following messages in a tight loop:
> log.TabletServerLogger INFO : Using next log hdfs://...
> tserver.TabletServfer INFO : Writing log marker for hdfs://...
> tserver.TabletServer INFO : Marking hdfs://... closed
> log.DfsLogger INFO : Slow sync cost ...
> ...
> Unfortunately we did not have DEBUG turned on so we have no debug messages.
> Tracking through the code there are three places where the 
> TabletServerLogger.close method is called:
> 1) via resetLoggers in the TabletServerLogger, but nothing calls this method 
> so this is ruled out
> 2) when the log gets too large or too old, but neither of those checks should 
> have been hitting here.
> 3) In a loop that is executed (while (!success)) in the 
> TabletServerLogger.write method.  In this case when we unsuccessfullty write 
> something to the wal, then that one is closed and a new one is created.  This 
> loop will go forever until we successfully write out the entry.  A 
> DfsLogger.LogClosedException seems the most logical reason.  This is most 
> likely because a ClosedChannelException was thrown from the DfsLogger.write 
> methods (around line 609 in DfsLogger).
> So the root cause was most likely hadoop related.  However in accumulo we 
> probably should not be doing a tight retry loop around a hadoop failure.  I 
> recommend at a minimum doing some sort of exponential back off and perhaps 
> setting a limit on the number of retries resulting in a critical tserver 
> failure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries

2018-01-08 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16316356#comment-16316356
 ] 

Ivan Bella commented on ACCUMULO-4777:
--

The stack overflow is basically as follows (accumulo 1.8.1):

All in TabletServerLogger:

defineTablet line 465
write line 382
write line 356
defineTablet line 465
write line 382
write line 356
...


> Root tablet got spammed with 1.8 million log entries
> 
>
> Key: ACCUMULO-4777
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
> Project: Accumulo
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Priority: Critical
>
> We had a tserver that was handling accumulo.metadata tablets that somehow got 
> into a loop where it created over 22K empty wal logs.  There were around 70 
> metadata tablets and this resulted in around 1.8 million log entries in added 
> to the accumulo.root table.  The only reason it stopped creating wal logs is 
> because it ran out of open file handles.  This took us many hours and cups of 
> coffee to clean up.
> The log contained the following messages in a tight loop:
> log.TabletServerLogger INFO : Using next log hdfs://...
> tserver.TabletServfer INFO : Writing log marker for hdfs://...
> tserver.TabletServer INFO : Marking hdfs://... closed
> log.DfsLogger INFO : Slow sync cost ...
> ...
> Unfortunately we did not have DEBUG turned on so we have no debug messages.
> Tracking through the code there are three places where the 
> TabletServerLogger.close method is called:
> 1) via resetLoggers in the TabletServerLogger, but nothing calls this method 
> so this is ruled out
> 2) when the log gets too large or too old, but neither of those checks should 
> have been hitting here.
> 3) In a loop that is executed (while (!success)) in the 
> TabletServerLogger.write method.  In this case when we unsuccessfullty write 
> something to the wal, then that one is closed and a new one is created.  This 
> loop will go forever until we successfully write out the entry.  A 
> DfsLogger.LogClosedException seems the most logical reason.  This is most 
> likely because a ClosedChannelException was thrown from the DfsLogger.write 
> methods (around line 609 in DfsLogger).
> So the root cause was most likely hadoop related.  However in accumulo we 
> probably should not be doing a tight retry loop around a hadoop failure.  I 
> recommend at a minimum doing some sort of exponential back off and perhaps 
> setting a limit on the number of retries resulting in a critical tserver 
> failure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries

2018-01-08 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16316337#comment-16316337
 ] 

Ivan Bella commented on ACCUMULO-4777:
--

This happened to us again, however this time everything appeared to recover .  
This time the loop appeared to terminate with a stack overflow error instead of 
running out of file descriptors first which may have allowed the tserver to 
remedy the situation earlier.  Also we had debug on so we are analyzing the 
logs to try and determine how it gets into this state in the first place.

> Root tablet got spammed with 1.8 million log entries
> 
>
> Key: ACCUMULO-4777
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
> Project: Accumulo
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Priority: Critical
>
> We had a tserver that was handling accumulo.metadata tablets that somehow got 
> into a loop where it created over 22K empty wal logs.  There were around 70 
> metadata tablets and this resulted in around 1.8 million log entries in added 
> to the accumulo.root table.  The only reason it stopped creating wal logs is 
> because it ran out of open file handles.  This took us many hours and cups of 
> coffee to clean up.
> The log contained the following messages in a tight loop:
> log.TabletServerLogger INFO : Using next log hdfs://...
> tserver.TabletServfer INFO : Writing log marker for hdfs://...
> tserver.TabletServer INFO : Marking hdfs://... closed
> log.DfsLogger INFO : Slow sync cost ...
> ...
> Unfortunately we did not have DEBUG turned on so we have no debug messages.
> Tracking through the code there are three places where the 
> TabletServerLogger.close method is called:
> 1) via resetLoggers in the TabletServerLogger, but nothing calls this method 
> so this is ruled out
> 2) when the log gets too large or too old, but neither of those checks should 
> have been hitting here.
> 3) In a loop that is executed (while (!success)) in the 
> TabletServerLogger.write method.  In this case when we unsuccessfullty write 
> something to the wal, then that one is closed and a new one is created.  This 
> loop will go forever until we successfully write out the entry.  A 
> DfsLogger.LogClosedException seems the most logical reason.  This is most 
> likely because a ClosedChannelException was thrown from the DfsLogger.write 
> methods (around line 609 in DfsLogger).
> So the root cause was most likely hadoop related.  However in accumulo we 
> probably should not be doing a tight retry loop around a hadoop failure.  I 
> recommend at a minimum doing some sort of exponential back off and perhaps 
> setting a limit on the number of retries resulting in a critical tserver 
> failure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries

2018-01-05 Thread Ivan Bella (JIRA)
Ivan Bella created ACCUMULO-4777:


 Summary: Root tablet got spammed with 1.8 million log entries
 Key: ACCUMULO-4777
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
 Project: Accumulo
  Issue Type: Bug
Affects Versions: 1.8.1
Reporter: Ivan Bella
Priority: Critical


We had a tserver that was handling accumulo.metadata tablets that somehow got 
into a loop where it created over 22K empty wal logs.  There were around 70 
metadata tablets and this resulted in around 1.8 million log entries in added 
to the accumulo.root table.  The only reason it stopped creating wal logs is 
because it ran out of open file handles.  This took us many hours and cups of 
coffee to clean up.

The log contained the following messages in a tight loop:

log.TabletServerLogger INFO : Using next log hdfs://...
tserver.TabletServfer INFO : Writing log marker for hdfs://...
tserver.TabletServer INFO : Marking hdfs://... closed
log.DfsLogger INFO : Slow sync cost ...
...

Unfortunately we did not have DEBUG turned on so we have no debug messages.

Tracking through the code there are three places where the 
TabletServerLogger.close method is called:
1) via resetLoggers in the TabletServerLogger, but nothing calls this method so 
this is ruled out
2) when the log gets too large or too old, but neither of those checks should 
have been hitting here.
3) In a loop that is executed (while (!success)) in the 
TabletServerLogger.write method.  In this case when we unsuccessfullty write 
something to the wal, then that one is closed and a new one is created.  This 
loop will go forever until we successfully write out the entry.  A 
DfsLogger.LogClosedException seems the most logical reason.  This is most 
likely because a ClosedChannelException was thrown from the DfsLogger.write 
methods (around line 609 in DfsLogger).

So the root cause was most likely hadoop related.  However in accumulo we 
probably should not be doing a tight retry loop around a hadoop failure.  I 
recommend at a minimum doing some sort of exponential back off and perhaps 
setting a limit on the number of retries resulting in a critical tserver 
failure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ACCUMULO-4749) Need a bulk loading test equivalent to continuous ingest

2017-12-01 Thread Ivan Bella (JIRA)
Ivan Bella created ACCUMULO-4749:


 Summary: Need a bulk loading test equivalent to continuous ingest
 Key: ACCUMULO-4749
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4749
 Project: Accumulo
  Issue Type: Improvement
  Components: test
Reporter: Ivan Bella


There are some known cases at least in past versions where bulk loading may 
fail leaving the ~blip in place but no transaction left to handle it.  This 
will result in directories of files being left around that are not loaded.  We 
should create a continuous ingest variant that uses bulk loading instead.  Then 
if this is run with agitation, the continuous ingest verification can find data 
that has been essentially orphaned.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ACCUMULO-4748) Verification of continuous ingest with aggitation should check for orphaned files

2017-12-01 Thread Ivan Bella (JIRA)
Ivan Bella created ACCUMULO-4748:


 Summary: Verification of continuous ingest with aggitation should 
check for orphaned files
 Key: ACCUMULO-4748
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4748
 Project: Accumulo
  Issue Type: Improvement
  Components: test
Reporter: Ivan Bella


We have found instances where files are being orphaned but we have not as of 
writing this ticket figured out why.  We should however expand the continuous 
ingest verification to check for orphaned files which we hypothesize will show 
up when agitation is when running continuous ingest.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ACCUMULO-4715) Accumulo upgrade path fails from 1.6 to 1.7/1.8 if PerTableVolumeChooser or PreferredTableVolumeChooser is the default

2017-10-04 Thread Ivan Bella (JIRA)
Ivan Bella created ACCUMULO-4715:


 Summary: Accumulo upgrade path fails from 1.6 to 1.7/1.8 if 
PerTableVolumeChooser or PreferredTableVolumeChooser is the default
 Key: ACCUMULO-4715
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4715
 Project: Accumulo
  Issue Type: Bug
Affects Versions: 1.8.1, 1.8.0, 1.7.3, 1.7.2
Reporter: Ivan Bella
Assignee: Ivan Bella
Priority: Critical
 Fix For: 1.7.4, 1.8.2, 2.0.0


The createReplicationTable method in the MetadataTableUtil used to create the 
replication table when upgrading from 1.6 and calls the default volume chooser 
to get a location.  The problem is that at that point the replication table 
does not exist and the PerTableVolumeChooser and the 
PreferredTableVolumeChooser classes will call the getTableConfiguration with 
the supplied table id which will return null.  Subsequently those volume 
choosers will throw a NullPointerException and the upgrade will fail.  Both of 
these volume choosers should instead use the fallbacks instead of failing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ACCUMULO-4686) 1.8 upgrade path does not update version on multiple volumes

2017-09-06 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella resolved ACCUMULO-4686.
--
Resolution: Fixed

Commits pushed to 1.7, 1.8, and master.

> 1.8 upgrade path does not update version on multiple volumes
> 
>
> Key: ACCUMULO-4686
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4686
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>Priority: Blocker
> Fix For: 1.7.4, 1.8.2, 2.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The upgrade path from 1.6+ to 1.8 does not update the version across all 
> volumes in a multi-hdfs-volume cluster.  Only one of them gets updated and 
> subsequently all of the tservers will complain and ultimately fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ACCUMULO-4686) 1.8 upgrade path does not update version on multiple volumes

2017-09-01 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16151021#comment-16151021
 ] 

Ivan Bella commented on ACCUMULO-4686:
--

Added test case as requested.  I believe this one is good to merge unless there 
are any more objections.

> 1.8 upgrade path does not update version on multiple volumes
> 
>
> Key: ACCUMULO-4686
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4686
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>Priority: Blocker
> Fix For: 1.7.4, 1.8.2, 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The upgrade path from 1.6+ to 1.8 does not update the version across all 
> volumes in a multi-hdfs-volume cluster.  Only one of them gets updated and 
> subsequently all of the tservers will complain and ultimately fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ACCUMULO-4691) HostRegexBalancer not reacting to configuration changes

2017-08-01 Thread Ivan Bella (JIRA)
Ivan Bella created ACCUMULO-4691:


 Summary: HostRegexBalancer not reacting to configuration changes
 Key: ACCUMULO-4691
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4691
 Project: Accumulo
  Issue Type: Improvement
Affects Versions: 1.8.1
Reporter: Ivan Bella
Assignee: Ivan Bella
 Fix For: 1.8.2, 2.0.0


The HostRegexTableBalancer does not react to changes in the system properties.  
It listens to the TableConfiguration for changes, but no the base accumulo 
properties.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ACCUMULO-4690) Observable configurations

2017-08-01 Thread Ivan Bella (JIRA)
Ivan Bella created ACCUMULO-4690:


 Summary: Observable configurations
 Key: ACCUMULO-4690
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4690
 Project: Accumulo
  Issue Type: Improvement
  Components: core
Affects Versions: 2.0.0
Reporter: Ivan Bella
Priority: Minor


Currently only the table and namespace configurations are observable (implement 
ObservableConfiguration).  In order to appropriately deal with system/site 
configuration changes one needs to use a Timer that periodically checks the 
configuration (see TabletServerResourceManager for example).  It would be nice 
if we create an observable mechanism for those changes as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ACCUMULO-4686) 1.8 upgrade path does not update version on multiple volumes

2017-07-26 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella updated ACCUMULO-4686:
-
Fix Version/s: 1.7.4

> 1.8 upgrade path does not update version on multiple volumes
> 
>
> Key: ACCUMULO-4686
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4686
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.7.4, 1.8.2, 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The upgrade path from 1.6+ to 1.8 does not update the version across all 
> volumes in a multi-hdfs-volume cluster.  Only one of them gets updated and 
> subsequently all of the tservers will complain and ultimately fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ACCUMULO-4686) 1.8 upgrade path does not update version on multiple volumes

2017-07-25 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100385#comment-16100385
 ] 

Ivan Bella commented on ACCUMULO-4686:
--

Ah, figured it out.  The routine is not using the fs specific version of 
getAccumuloPersistentVersion to determine the version on the volume of interest.

> 1.8 upgrade path does not update version on multiple volumes
> 
>
> Key: ACCUMULO-4686
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4686
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.8.2, 2.0.0
>
>
> The upgrade path from 1.6+ to 1.8 does not update the version across all 
> volumes in a multi-hdfs-volume cluster.  Only one of them gets updated and 
> subsequently all of the tservers will complain and ultimately fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ACCUMULO-4686) 1.8 upgrade path does not update version on multiple volumes

2017-07-25 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100377#comment-16100377
 ] 

Ivan Bella commented on ACCUMULO-4686:
--

The code suggests that this is done correctly (Accumulo.upgradeAccumuloVersion) 
however we definitely had an instance where this failed.  Doing more 
investigation as to why.

> 1.8 upgrade path does not update version on multiple volumes
> 
>
> Key: ACCUMULO-4686
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4686
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.8.1
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.8.2, 2.0.0
>
>
> The upgrade path from 1.6+ to 1.8 does not update the version across all 
> volumes in a multi-hdfs-volume cluster.  Only one of them gets updated and 
> subsequently all of the tservers will complain and ultimately fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ACCUMULO-4686) 1.8 upgrade path does not update version on multiple volumes

2017-07-25 Thread Ivan Bella (JIRA)
Ivan Bella created ACCUMULO-4686:


 Summary: 1.8 upgrade path does not update version on multiple 
volumes
 Key: ACCUMULO-4686
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4686
 Project: Accumulo
  Issue Type: Bug
  Components: master
Affects Versions: 1.8.1
Reporter: Ivan Bella
Assignee: Ivan Bella
 Fix For: 1.8.2, 2.0.0


The upgrade path from 1.6+ to 1.8 does not update the version across all 
volumes in a multi-hdfs-volume cluster.  Only one of them gets updated and 
subsequently all of the tservers will complain and ultimately fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ACCUMULO-4086) Allow configurable failsafe volume choosing

2017-07-24 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099197#comment-16099197
 ] 

Ivan Bella commented on ACCUMULO-4086:
--

I am needing this change for deployment onto our test system.  Can we attempt 
to get this one wrapped up?  I see there are now conflicts that have crept in. 
I will attempt to resolve those with another pull request into your branch.

> Allow configurable failsafe volume choosing
> ---
>
> Key: ACCUMULO-4086
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4086
> Project: Accumulo
>  Issue Type: Sub-task
>  Components: core
>Reporter: Christopher Tubbs
>Assignee: Christopher Tubbs
> Fix For: 2.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> From parent issue:
> {quote}
> 3. In several places including {{PreferredVolumeChooser}}, 
> {{PerTableVolumeChooser}} and {{VolumeManagerImpl}}, the failsafe chooser is 
> the {{RandomVolumeChooser}} which will include the instance volume that needs 
> to be excluded.  It would be useful to have a configurable failsafe in this 
> situation.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ACCUMULO-4667) LocalityGroupIterator very inefficient with large locality groups

2017-07-23 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella resolved ACCUMULO-4667.
--
Resolution: Fixed

> LocalityGroupIterator very inefficient with large locality groups
> -
>
> Key: ACCUMULO-4667
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4667
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.6.6, 1.7.3, 1.8.1, 2.0.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.8.2, 2.0.0
>
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> On one of our systems we tracked some scans that were taking an extremely 
> long time to complete (many hours).  As it turns out the scan was relatively 
> simple in that it was scanning a tablet for all keys that had a specific 
> column family.  Note that there was very little data that actually matched 
> this column familiy.  Upon tracing the code we found that it was spending a 
> large amount of time in the LocalityGroupIterator.  Stack traces continually 
> found the code to be at line 128 or 129 of the LocalityGroupIterator.  Those 
> line numbers are consistent from the 1.6 series all the way to 2.0.0 
> (master).  In this case the column family being searched for was included in 
> one of a dozen or so locality groups on that table, and the locality group 
> itself had 40 or so column families.  We see several things that can be done 
> here:
> 1) The code that checks the group column families against those being 
> searched for can quickly exit once if finds a match
> 2) The code that checks the group column families against those being 
> searched for can look at the relative size of those two groups an invert the 
> logic appropriately for a more efficient loop.
> 3) We could create a cached map of column families to locality groups 
> allowing us to avoid examining each locality group every time we seek.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ACCUMULO-4655) monitor should show response time in addition to last contact

2017-07-23 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella resolved ACCUMULO-4655.
--
Resolution: Fixed

> monitor should show response time in addition to last contact
> -
>
> Key: ACCUMULO-4655
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4655
> Project: Accumulo
>  Issue Type: New Feature
>  Components: monitor
>Affects Versions: 1.6.6, 1.8.1
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.8.2, 2.0.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> The monitor has a column for last contact.  Sometimes that is due to a 
> tserver not responding.  If we show the response time for a tserver to return 
> its status, then we would have the appropriate info.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ACCUMULO-4643) Allow iterators to interrupt themselves

2017-07-20 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella resolved ACCUMULO-4643.
--
Resolution: Fixed

Pull requests have been merged.  I also did a post merge from 1.8.2 into 2.0.0 
to ensure that was clean.

> Allow iterators to interrupt themselves
> ---
>
> Key: ACCUMULO-4643
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4643
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.8.1, 2.0.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>  Labels: features
> Fix For: 1.8.2, 2.0.0
>
>  Time Spent: 18h 20m
>  Remaining Estimate: 0h
>
> The idea here is to allow an iterator stack to send back a special key or 
> throw a special exception which will allow the tablet server to tear down the 
> scan to be rebuilt later.  This is to handle the case where an iterator is 
> doing a lot of work without returning results to avoid starving out other 
> scans.
> There are two thoughts on how to do this:
> 1) A special "interrupt" key is returned from the getTopKey call that is 
> detected in the Tablet.nextBatch call, is not added to the results, but is 
> used to add an unfinished range and results in the remaining ranges to be 
> deemed unfinished.
> 2) An special exception is thrown from the next or seek call that included 
> the key of the current position, and the same actions are taken as in 1).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ACCUMULO-4643) Allow iterators to interrupt themselves

2017-07-20 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella updated ACCUMULO-4643:
-
Fix Version/s: 1.8.2

> Allow iterators to interrupt themselves
> ---
>
> Key: ACCUMULO-4643
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4643
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.8.1, 2.0.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>  Labels: features
> Fix For: 1.8.2, 2.0.0
>
>  Time Spent: 18h 20m
>  Remaining Estimate: 0h
>
> The idea here is to allow an iterator stack to send back a special key or 
> throw a special exception which will allow the tablet server to tear down the 
> scan to be rebuilt later.  This is to handle the case where an iterator is 
> doing a lot of work without returning results to avoid starving out other 
> scans.
> There are two thoughts on how to do this:
> 1) A special "interrupt" key is returned from the getTopKey call that is 
> detected in the Tablet.nextBatch call, is not added to the results, but is 
> used to add an unfinished range and results in the remaining ranges to be 
> deemed unfinished.
> 2) An special exception is thrown from the next or seek call that included 
> the key of the current position, and the same actions are taken as in 1).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ACCUMULO-4643) Allow iterators to interrupt themselves

2017-07-20 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095185#comment-16095185
 ] 

Ivan Bella commented on ACCUMULO-4643:
--

I was working to ensure that the merge from 1.8 into master was clean and did 
not undo anything.  I will close things up now.

> Allow iterators to interrupt themselves
> ---
>
> Key: ACCUMULO-4643
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4643
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.8.1, 2.0.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>  Labels: features
> Fix For: 2.0.0
>
>  Time Spent: 18h 20m
>  Remaining Estimate: 0h
>
> The idea here is to allow an iterator stack to send back a special key or 
> throw a special exception which will allow the tablet server to tear down the 
> scan to be rebuilt later.  This is to handle the case where an iterator is 
> doing a lot of work without returning results to avoid starving out other 
> scans.
> There are two thoughts on how to do this:
> 1) A special "interrupt" key is returned from the getTopKey call that is 
> detected in the Tablet.nextBatch call, is not added to the results, but is 
> used to add an unfinished range and results in the remaining ranges to be 
> deemed unfinished.
> 2) An special exception is thrown from the next or seek call that included 
> the key of the current position, and the same actions are taken as in 1).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ACCUMULO-4654) HostRegex balancer should allow migration even when we have pending migrations

2017-07-19 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella resolved ACCUMULO-4654.
--
Resolution: Fixed

The HostRegex balancer will now allow migrations to continue even if current 
migrations exist up to the configured amount specified by 
table.custom.balancer.host.regex.max.outstanding.migrations (default 0).

> HostRegex balancer should allow migration even when we have pending migrations
> --
>
> Key: ACCUMULO-4654
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4654
> Project: Accumulo
>  Issue Type: New Feature
>  Components: master
>Affects Versions: 1.6.6, 1.8.1
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>  Labels: balancer, master, migration
> Fix For: 1.8.2, 2.0.0
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> The HostRegexTableBalancer current halts all migrations when there are 
> pending migrations.  I propose fixing this to allow adding additional 
> migrations even when there are pending migrations up to a specified amount.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ACCUMULO-4043) start-up needs to scale with the number of servers

2017-06-29 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella reassigned ACCUMULO-4043:


Assignee: (was: Ivan Bella)

> start-up needs to scale with the number of servers
> --
>
> Key: ACCUMULO-4043
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4043
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.6.4
> Environment: very large production cluster
>Reporter: Eric Newton
> Fix For: 2.0.0
>
>
> When starting a very large production cluster, the master gets running fast 
> enough that it loads the metadata table before all the tservers are 
> registered and running. The result is a very long balancing period after all 
> servers started. The wait period for the tablet server stabilization needs to 
> scale somewhat with the number of tablet servers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ACCUMULO-4667) LocalityGroupIterator very inefficient with large locality groups

2017-06-27 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065362#comment-16065362
 ] 

Ivan Bella commented on ACCUMULO-4667:
--

[~kturner] You are correct.  I believe that is what the count is used for in 
the map passed into the seek call.  I will used that to pre-filter the locality 
groups as is currently being done in the seek.

> LocalityGroupIterator very inefficient with large locality groups
> -
>
> Key: ACCUMULO-4667
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4667
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.6.6, 1.7.3, 1.8.1, 2.0.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.8.2, 2.0.0
>
>
> On one of our systems we tracked some scans that were taking an extremely 
> long time to complete (many hours).  As it turns out the scan was relatively 
> simple in that it was scanning a tablet for all keys that had a specific 
> column family.  Note that there was very little data that actually matched 
> this column familiy.  Upon tracing the code we found that it was spending a 
> large amount of time in the LocalityGroupIterator.  Stack traces continually 
> found the code to be at line 128 or 129 of the LocalityGroupIterator.  Those 
> line numbers are consistent from the 1.6 series all the way to 2.0.0 
> (master).  In this case the column family being searched for was included in 
> one of a dozen or so locality groups on that table, and the locality group 
> itself had 40 or so column families.  We see several things that can be done 
> here:
> 1) The code that checks the group column families against those being 
> searched for can quickly exit once if finds a match
> 2) The code that checks the group column families against those being 
> searched for can look at the relative size of those two groups an invert the 
> logic appropriately for a more efficient loop.
> 3) We could create a cached map of column families to locality groups 
> allowing us to avoid examining each locality group every time we seek.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ACCUMULO-4667) LocalityGroupIterator very inefficient with large locality groups

2017-06-27 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella updated ACCUMULO-4667:
-
Description: 
On one of our systems we tracked some scans that were taking an extremely long 
time to complete (many hours).  As it turns out the scan was relatively simple 
in that it was scanning a tablet for all keys that had a specific column 
family.  Note that there was very little data that actually matched this column 
familiy.  Upon tracing the code we found that it was spending a large amount of 
time in the LocalityGroupIterator.  Stack traces continually found the code to 
be at line 128 or 129 of the LocalityGroupIterator.  Those line numbers are 
consistent from the 1.6 series all the way to 2.0.0 (master).  In this case the 
column family being searched for was included in one of a dozen or so locality 
groups on that table, and the locality group itself had 40 or so column 
families.  We see several things that can be done here:

1) The code that checks the group column families against those being searched 
for can quickly exit once if finds a match
2) The code that checks the group column families against those being searched 
for can look at the relative size of those two groups an invert the logic 
appropriately for a more efficient loop.
3) We could create a cached map of column families to locality groups allowing 
us to avoid examining each locality group every time we seek.

  was:
On one of our systems we tracked some scans that were taking an extremely long 
time to complete (many hours).  As it turns out the scan was relatively simple 
in that it was scanning a tablet for all keys that had a specific column 
family.  Note that there was very little data that actually matched this column 
familiy.  Upon tracing the code we found that it was spending a large amount of 
time in the LocalityGroupIterator.  Stack traces continually found the code to 
be at list 128 or 129 of the LocalityGroupIterator.  Those line numbers are 
consistent from the 1.6 series all the way to 2.0.0 (master).  In this case the 
column family being searched for was included in one of a dozen or so locality 
groups on that table, and the locality group itself had 40 or so column 
families.  We see several things that can be done here:

1) The code that checks the group column families against those being searched 
for can quickly exit once if finds a match
2) The code that checks the group column families against those being searched 
for can look at the relative size of those two groups an invert the logic 
appropriately for a more efficient loop.
3) We could create a cached map of column families to locality groups allowing 
us to avoid examining each locality group every time we seek.


> LocalityGroupIterator very inefficient with large locality groups
> -
>
> Key: ACCUMULO-4667
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4667
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.6.6, 1.7.3, 1.8.1, 2.0.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.8.2, 2.0.0
>
>
> On one of our systems we tracked some scans that were taking an extremely 
> long time to complete (many hours).  As it turns out the scan was relatively 
> simple in that it was scanning a tablet for all keys that had a specific 
> column family.  Note that there was very little data that actually matched 
> this column familiy.  Upon tracing the code we found that it was spending a 
> large amount of time in the LocalityGroupIterator.  Stack traces continually 
> found the code to be at line 128 or 129 of the LocalityGroupIterator.  Those 
> line numbers are consistent from the 1.6 series all the way to 2.0.0 
> (master).  In this case the column family being searched for was included in 
> one of a dozen or so locality groups on that table, and the locality group 
> itself had 40 or so column families.  We see several things that can be done 
> here:
> 1) The code that checks the group column families against those being 
> searched for can quickly exit once if finds a match
> 2) The code that checks the group column families against those being 
> searched for can look at the relative size of those two groups an invert the 
> logic appropriately for a more efficient loop.
> 3) We could create a cached map of column families to locality groups 
> allowing us to avoid examining each locality group every time we seek.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ACCUMULO-4667) LocalityGroupIterator very inefficient with large locality groups

2017-06-27 Thread Ivan Bella (JIRA)
Ivan Bella created ACCUMULO-4667:


 Summary: LocalityGroupIterator very inefficient with large 
locality groups
 Key: ACCUMULO-4667
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4667
 Project: Accumulo
  Issue Type: Improvement
  Components: tserver
Affects Versions: 1.8.1, 1.7.3, 1.6.6, 2.0.0
Reporter: Ivan Bella
Assignee: Ivan Bella
 Fix For: 1.8.2, 2.0.0


On one of our systems we tracked some scans that were taking an extremely long 
time to complete (many hours).  As it turns out the scan was relatively simple 
in that it was scanning a tablet for all keys that had a specific column 
family.  Note that there was very little data that actually matched this column 
familiy.  Upon tracing the code we found that it was spending a large amount of 
time in the LocalityGroupIterator.  Stack traces continually found the code to 
be at list 128 or 129 of the LocalityGroupIterator.  Those line numbers are 
consistent from the 1.6 series all the way to 2.0.0 (master).  In this case the 
column family being searched for was included in one of a dozen or so locality 
groups on that table, and the locality group itself had 40 or so column 
families.  We see several things that can be done here:

1) The code that checks the group column families against those being searched 
for can quickly exit once if finds a match
2) The code that checks the group column families against those being searched 
for can look at the relative size of those two groups an invert the logic 
appropriately for a more efficient loop.
3) We could create a cached map of column families to locality groups allowing 
us to avoid examining each locality group every time we seek.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ACCUMULO-4661) Balancing should be done outside of status thread

2017-06-21 Thread Ivan Bella (JIRA)
Ivan Bella created ACCUMULO-4661:


 Summary: Balancing should be done outside of status thread
 Key: ACCUMULO-4661
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4661
 Project: Accumulo
  Issue Type: Improvement
  Components: master
Affects Versions: 1.8.1
Reporter: Ivan Bella
Assignee: Ivan Bella
 Fix For: 1.8.2, 2.0.0


The balancing can sometimes take a long time which results in stale tserver 
status as seen in the monitor.  I suggest balancing in a separate thread 
allowing the status thread to continue at a regular interval.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ACCUMULO-4086) Allow configurable failsafe volume choosing

2017-06-16 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052102#comment-16052102
 ] 

Ivan Bella commented on ACCUMULO-4086:
--

Are we ready to merge this one in?

> Allow configurable failsafe volume choosing
> ---
>
> Key: ACCUMULO-4086
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4086
> Project: Accumulo
>  Issue Type: Sub-task
>  Components: core
>Reporter: Christopher Tubbs
>Assignee: Christopher Tubbs
> Fix For: 2.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> From parent issue:
> {quote}
> 3. In several places including {{PreferredVolumeChooser}}, 
> {{PerTableVolumeChooser}} and {{VolumeManagerImpl}}, the failsafe chooser is 
> the {{RandomVolumeChooser}} which will include the instance volume that needs 
> to be excluded.  It would be useful to have a configurable failsafe in this 
> situation.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ACCUMULO-4655) monitor should show response time in addition to last contact

2017-06-16 Thread Ivan Bella (JIRA)
Ivan Bella created ACCUMULO-4655:


 Summary: monitor should show response time in addition to last 
contact
 Key: ACCUMULO-4655
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4655
 Project: Accumulo
  Issue Type: New Feature
  Components: monitor
Affects Versions: 1.8.1, 1.6.6
Reporter: Ivan Bella
Assignee: Ivan Bella
 Fix For: 1.8.2, 2.0.0


The monitor has a column for last contact.  Sometimes that is due to a tserver 
not responding.  If we show the response time for a tserver to return its 
status, then we would have the appropriate info.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ACCUMULO-4654) HostRegex balancer should allow migration even when we have pending migrations

2017-06-16 Thread Ivan Bella (JIRA)
Ivan Bella created ACCUMULO-4654:


 Summary: HostRegex balancer should allow migration even when we 
have pending migrations
 Key: ACCUMULO-4654
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4654
 Project: Accumulo
  Issue Type: New Feature
  Components: master
Affects Versions: 1.8.1, 1.6.6
Reporter: Ivan Bella
Assignee: Ivan Bella
 Fix For: 1.8.2, 2.0.0


The HostRegexTableBalancer current halts all migrations when there are pending 
migrations.  I propose fixing this to allow adding additional migrations even 
when there are pending migrations up to a specified amount.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ACCUMULO-4643) Allow iterators to interrupt themselves

2017-06-10 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045563#comment-16045563
 ] 

Ivan Bella commented on ACCUMULO-4643:
--

[~kturner] It is interesting in test 1 that the standard deviation is so high.  
With 128 threads and 127 scans, I would have expected all of the scans to get a 
fair amount of processing time.  I am wondering if the max open files is 
causing issues there.  That might explain why the mean and standard deviations 
is lower in test 2 since the files get closed when one scan yields.  These 
types of results I would expect when the number of tserver threads is smaller 
than the number of long running scans and would be exactly what I would want to 
see.
I am not sure what to do with test 3.  I am guessing there is a test 4 coming 
which is comparable to test 3 but with yielding.

> Allow iterators to interrupt themselves
> ---
>
> Key: ACCUMULO-4643
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4643
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.8.1, 2.0.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>  Labels: features
> Fix For: 2.0.0
>
>  Time Spent: 17h 10m
>  Remaining Estimate: 0h
>
> The idea here is to allow an iterator stack to send back a special key or 
> throw a special exception which will allow the tablet server to tear down the 
> scan to be rebuilt later.  This is to handle the case where an iterator is 
> doing a lot of work without returning results to avoid starving out other 
> scans.
> There are two thoughts on how to do this:
> 1) A special "interrupt" key is returned from the getTopKey call that is 
> detected in the Tablet.nextBatch call, is not added to the results, but is 
> used to add an unfinished range and results in the remaining ranges to be 
> deemed unfinished.
> 2) An special exception is thrown from the next or seek call that included 
> the key of the current position, and the same actions are taken as in 1).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ACCUMULO-4643) Allow iterators to interrupt themselves

2017-06-02 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella updated ACCUMULO-4643:
-
Fix Version/s: 1.8.2

> Allow iterators to interrupt themselves
> ---
>
> Key: ACCUMULO-4643
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4643
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.8.1, 2.0.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>  Labels: features
> Fix For: 1.8.2, 2.0.0
>
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> The idea here is to allow an iterator stack to send back a special key or 
> throw a special exception which will allow the tablet server to tear down the 
> scan to be rebuilt later.  This is to handle the case where an iterator is 
> doing a lot of work without returning results to avoid starving out other 
> scans.
> There are two thoughts on how to do this:
> 1) A special "interrupt" key is returned from the getTopKey call that is 
> detected in the Tablet.nextBatch call, is not added to the results, but is 
> used to add an unfinished range and results in the remaining ranges to be 
> deemed unfinished.
> 2) An special exception is thrown from the next or seek call that included 
> the key of the current position, and the same actions are taken as in 1).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ACCUMULO-4643) Allow iterators to interrupt themselves

2017-06-02 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034793#comment-16034793
 ] 

Ivan Bella commented on ACCUMULO-4643:
--

I added a second pull request which back ports this to 1.8

> Allow iterators to interrupt themselves
> ---
>
> Key: ACCUMULO-4643
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4643
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.8.1, 2.0.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>  Labels: features
> Fix For: 1.8.2, 2.0.0
>
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> The idea here is to allow an iterator stack to send back a special key or 
> throw a special exception which will allow the tablet server to tear down the 
> scan to be rebuilt later.  This is to handle the case where an iterator is 
> doing a lot of work without returning results to avoid starving out other 
> scans.
> There are two thoughts on how to do this:
> 1) A special "interrupt" key is returned from the getTopKey call that is 
> detected in the Tablet.nextBatch call, is not added to the results, but is 
> used to add an unfinished range and results in the remaining ranges to be 
> deemed unfinished.
> 2) An special exception is thrown from the next or seek call that included 
> the key of the current position, and the same actions are taken as in 1).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ACCUMULO-4643) Allow iterators to interrupt themselves

2017-05-31 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032011#comment-16032011
 ] 

Ivan Bella commented on ACCUMULO-4643:
--

Changed the API to be enableYielding(YieldCallback) instead of separate methods 
on the SKVI

> Allow iterators to interrupt themselves
> ---
>
> Key: ACCUMULO-4643
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4643
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.8.1, 2.0.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>  Labels: features
> Fix For: 2.0.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> The idea here is to allow an iterator stack to send back a special key or 
> throw a special exception which will allow the tablet server to tear down the 
> scan to be rebuilt later.  This is to handle the case where an iterator is 
> doing a lot of work without returning results to avoid starving out other 
> scans.
> There are two thoughts on how to do this:
> 1) A special "interrupt" key is returned from the getTopKey call that is 
> detected in the Tablet.nextBatch call, is not added to the results, but is 
> used to add an unfinished range and results in the remaining ranges to be 
> deemed unfinished.
> 2) An special exception is thrown from the next or seek call that included 
> the key of the current position, and the same actions are taken as in 1).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ACCUMULO-4643) Allow iterators to interrupt themselves

2017-05-31 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16031804#comment-16031804
 ] 

Ivan Bella commented on ACCUMULO-4643:
--

I reworked the pull request to use interface methods instead of throwing 
exceptions. This also ensures that the yielding mechanism will not interrupt 
when using row isolation mode.

> Allow iterators to interrupt themselves
> ---
>
> Key: ACCUMULO-4643
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4643
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.8.1, 2.0.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>  Labels: features
> Fix For: 2.0.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> The idea here is to allow an iterator stack to send back a special key or 
> throw a special exception which will allow the tablet server to tear down the 
> scan to be rebuilt later.  This is to handle the case where an iterator is 
> doing a lot of work without returning results to avoid starving out other 
> scans.
> There are two thoughts on how to do this:
> 1) A special "interrupt" key is returned from the getTopKey call that is 
> detected in the Tablet.nextBatch call, is not added to the results, but is 
> used to add an unfinished range and results in the remaining ranges to be 
> deemed unfinished.
> 2) An special exception is thrown from the next or seek call that included 
> the key of the current position, and the same actions are taken as in 1).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ACCUMULO-4643) Allow iterators to interrupt themselves

2017-05-30 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030078#comment-16030078
 ] 

Ivan Bella commented on ACCUMULO-4643:
--

I will change the implementation as such in the next few days

> Allow iterators to interrupt themselves
> ---
>
> Key: ACCUMULO-4643
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4643
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.8.1, 2.0.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>  Labels: features
> Fix For: 2.0.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> The idea here is to allow an iterator stack to send back a special key or 
> throw a special exception which will allow the tablet server to tear down the 
> scan to be rebuilt later.  This is to handle the case where an iterator is 
> doing a lot of work without returning results to avoid starving out other 
> scans.
> There are two thoughts on how to do this:
> 1) A special "interrupt" key is returned from the getTopKey call that is 
> detected in the Tablet.nextBatch call, is not added to the results, but is 
> used to add an unfinished range and results in the remaining ranges to be 
> deemed unfinished.
> 2) An special exception is thrown from the next or seek call that included 
> the key of the current position, and the same actions are taken as in 1).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ACCUMULO-4643) Allow iterators to interrupt themselves

2017-05-30 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030077#comment-16030077
 ] 

Ivan Bella commented on ACCUMULO-4643:
--

After a lengthy discussion with [~kturner], we came to the conclusion that if 
this mechanism is put in place then iterators that implement a yielding 
mechanism can only be called from iterators that are aware of that fact.  Hence 
a separate interface (or perhaps changing SKVI) that has methods something like 
the following would be better:

// Called to tell an iterator that yielding is supported.
public void setYieldSupported();
// Called after every next and seek call to determine it has yielded
public boolean hasYielded();
// Called after hasYielded returns true to determine the key to re-seek to later
public Key getYieldPosition();

> Allow iterators to interrupt themselves
> ---
>
> Key: ACCUMULO-4643
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4643
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.8.1, 2.0.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>  Labels: features
> Fix For: 2.0.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> The idea here is to allow an iterator stack to send back a special key or 
> throw a special exception which will allow the tablet server to tear down the 
> scan to be rebuilt later.  This is to handle the case where an iterator is 
> doing a lot of work without returning results to avoid starving out other 
> scans.
> There are two thoughts on how to do this:
> 1) A special "interrupt" key is returned from the getTopKey call that is 
> detected in the Tablet.nextBatch call, is not added to the results, but is 
> used to add an unfinished range and results in the remaining ranges to be 
> deemed unfinished.
> 2) An special exception is thrown from the next or seek call that included 
> the key of the current position, and the same actions are taken as in 1).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ACCUMULO-4643) Allow iterators to interrupt themselves

2017-05-30 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029882#comment-16029882
 ] 

Ivan Bella commented on ACCUMULO-4643:
--

[~kturner] The behaviour is expected to be identical to when an iterator is 
torn down because the buffer is filled, or the timeout is reached.

> Allow iterators to interrupt themselves
> ---
>
> Key: ACCUMULO-4643
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4643
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.8.1, 2.0.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>  Labels: features
> Fix For: 2.0.0
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> The idea here is to allow an iterator stack to send back a special key or 
> throw a special exception which will allow the tablet server to tear down the 
> scan to be rebuilt later.  This is to handle the case where an iterator is 
> doing a lot of work without returning results to avoid starving out other 
> scans.
> There are two thoughts on how to do this:
> 1) A special "interrupt" key is returned from the getTopKey call that is 
> detected in the Tablet.nextBatch call, is not added to the results, but is 
> used to add an unfinished range and results in the remaining ranges to be 
> deemed unfinished.
> 2) An special exception is thrown from the next or seek call that included 
> the key of the current position, and the same actions are taken as in 1).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ACCUMULO-4643) Allow iterators to interrupt themselves

2017-05-30 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029865#comment-16029865
 ] 

Ivan Bella commented on ACCUMULO-4643:
--

[~kturner] You are correct in that whatever key gets sent back with the 
exception (or the "interrupt" key) would have to be transformed.  In fact it 
would need contain enough information such that the subsequent seek invocation 
(using that key as the start of the range, non-inclusive) will allow the seek 
to continue where it left off.  Option #2 seemed simpler to implement 
initially.  I am going to implement an option #3 where I create a interface 
that can be used to extend an SKVI with methods to implement this concept 
instead of using an exception.  Option #1 seemed like a potential performance 
problem as we would need to check the class of the returned key on every cycle. 
 It is thought that an iterator that yields using an exception would only be 
implemented with iterators that may do a lot of work for no gain and hence this 
would happen infrequently.

> Allow iterators to interrupt themselves
> ---
>
> Key: ACCUMULO-4643
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4643
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.8.1, 2.0.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>  Labels: features
> Fix For: 2.0.0
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> The idea here is to allow an iterator stack to send back a special key or 
> throw a special exception which will allow the tablet server to tear down the 
> scan to be rebuilt later.  This is to handle the case where an iterator is 
> doing a lot of work without returning results to avoid starving out other 
> scans.
> There are two thoughts on how to do this:
> 1) A special "interrupt" key is returned from the getTopKey call that is 
> detected in the Tablet.nextBatch call, is not added to the results, but is 
> used to add an unfinished range and results in the remaining ranges to be 
> deemed unfinished.
> 2) An special exception is thrown from the next or seek call that included 
> the key of the current position, and the same actions are taken as in 1).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ACCUMULO-4643) Allow iterators to interrupt themselves

2017-05-25 Thread Ivan Bella (JIRA)
Ivan Bella created ACCUMULO-4643:


 Summary: Allow iterators to interrupt themselves
 Key: ACCUMULO-4643
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4643
 Project: Accumulo
  Issue Type: Improvement
  Components: tserver
Affects Versions: 1.8.1, 2.0.0
Reporter: Ivan Bella
Assignee: Ivan Bella
 Fix For: 2.0.0


The idea here is to allow an iterator stack to send back a special key or throw 
a special exception which will allow the tablet server to tear down the scan to 
be rebuilt later.  This is to handle the case where an iterator is doing a lot 
of work without returning results to avoid starving out other scans.

There are two thoughts on how to do this:
1) A special "interrupt" key is returned from the getTopKey call that is 
detected in the Tablet.nextBatch call, is not added to the results, but is used 
to add an unfinished range and results in the remaining ranges to be deemed 
unfinished.
2) An special exception is thrown from the next or seek call that included the 
key of the current position, and the same actions are taken as in 1).




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (ACCUMULO-4074) create user-configurable resource pools for different kinds of requests

2017-05-23 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella reassigned ACCUMULO-4074:


Assignee: Keith Turner

> create user-configurable resource pools for different kinds of requests
> ---
>
> Key: ACCUMULO-4074
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4074
> Project: Accumulo
>  Issue Type: Improvement
>  Components: client, tserver
>Reporter: Eric Newton
>Assignee: Keith Turner
>
> Complex queries and iterator stacks can sometimes run for long periods of 
> time.  During that time, access to resources for shorter, simpler lookups can 
> be blocked.  Use separate resource pools to allow for simpler queries to be 
> able to run regardless.  This same mechanism could be used for the metadata 
> and root tables, too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ACCUMULO-4634) MockIteratorEnvironment implementation broken in mock accumulo framework

2017-05-15 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella updated ACCUMULO-4634:
-
Status: Open  (was: Patch Available)

> MockIteratorEnvironment implementation broken in mock accumulo framework
> 
>
> Key: ACCUMULO-4634
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4634
> Project: Accumulo
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.8.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>  Labels: mock
> Fix For: 1.8.2, 2.0.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The MockIteratorEnvironment was broken when the additional sampling methods 
> were added.  The isSamplingEnabled throws an unsupported operation exception 
> instead of simply returning false.  Also the getSamplerConfiguration should 
> return null and the cloneWithSamplingEnabled should throw a 
> SampleNotPresentException instead an UnsupportedOperationException per the 
> documentation in the IteratorEnvironment interface.  This will allow its use 
> to work smoothly as before.  Granted this whole MockAccumulo mechanism is 
> deprecated, however external projects still have dependencies on it and hence 
> we should not have broken its implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ACCUMULO-4634) MockIteratorEnvironment implementation broken in mock accumulo framework

2017-05-15 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella updated ACCUMULO-4634:
-
Status: Patch Available  (was: In Progress)

> MockIteratorEnvironment implementation broken in mock accumulo framework
> 
>
> Key: ACCUMULO-4634
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4634
> Project: Accumulo
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.8.0
>Reporter: Ivan Bella
>Assignee: Ivan Bella
>  Labels: mock
> Fix For: 1.8.2, 2.0.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The MockIteratorEnvironment was broken when the additional sampling methods 
> were added.  The isSamplingEnabled throws an unsupported operation exception 
> instead of simply returning false.  Also the getSamplerConfiguration should 
> return null and the cloneWithSamplingEnabled should throw a 
> SampleNotPresentException instead an UnsupportedOperationException per the 
> documentation in the IteratorEnvironment interface.  This will allow its use 
> to work smoothly as before.  Granted this whole MockAccumulo mechanism is 
> deprecated, however external projects still have dependencies on it and hence 
> we should not have broken its implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ACCUMULO-4634) MockIteratorEnvironment implementation broken in mock accumulo framework

2017-05-03 Thread Ivan Bella (JIRA)
Ivan Bella created ACCUMULO-4634:


 Summary: MockIteratorEnvironment implementation broken in mock 
accumulo framework
 Key: ACCUMULO-4634
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4634
 Project: Accumulo
  Issue Type: Bug
  Components: core
Affects Versions: 1.8.0
Reporter: Ivan Bella
Assignee: Ivan Bella
 Fix For: 1.8.2, 2.0.0


The MockIteratorEnvironment was broken when the additional sampling methods 
were added.  The isSamplingEnabled throws an unsupported operation exception 
instead of simply returning false.  Also the getSamplerConfiguration should 
return null and the cloneWithSamplingEnabled should throw a 
SampleNotPresentException instead an UnsupportedOperationException per the 
documentation in the IteratorEnvironment interface.  This will allow its use to 
work smoothly as before.  Granted this whole MockAccumulo mechanism is 
deprecated, however external projects still have dependencies on it and hence 
we should not have broken its implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (ACCUMULO-835) not balancing because tablets are offline

2017-02-15 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella reassigned ACCUMULO-835:
---

Assignee: Ivan Bella

> not balancing because tablets are offline
> -
>
> Key: ACCUMULO-835
> URL: https://issues.apache.org/jira/browse/ACCUMULO-835
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Reporter: Eric Newton
>Assignee: Ivan Bella
>Priority: Minor
>
> I started a large merge and chopping took a long time.  I added a new server, 
> and it didn't get any new tablets because the merge had several tablets 
> offline.  There needs to be a better heuristic for balancing even if a few 
> tablets are offline.  Perhaps run balancing if the imbalance is greater than 
> the number of offline tablets and there are no migrations underway?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (ACCUMULO-4043) start-up needs to scale with the number of servers

2017-02-15 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella reassigned ACCUMULO-4043:


Assignee: Ivan Bella

> start-up needs to scale with the number of servers
> --
>
> Key: ACCUMULO-4043
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4043
> Project: Accumulo
>  Issue Type: Bug
>  Components: scripts
>Affects Versions: 1.6.4
> Environment: very large production cluster
>Reporter: Eric Newton
>Assignee: Ivan Bella
> Fix For: 1.8.2, 2.0.0
>
>
> When starting a very large production cluster, the master gets running fast 
> enough that it loads the metadata table before all the tservers are 
> registered and running. The result is a very long balancing period after all 
> servers started. The wait period for the tablet server stabilization needs to 
> scale somewhat with the number of tablet servers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ACCUMULO-3700) Table Balancer support for multi-tennacy

2017-02-15 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella resolved ACCUMULO-3700.
--
Resolution: Duplicate

> Table Balancer support for multi-tennacy
> 
>
> Key: ACCUMULO-3700
> URL: https://issues.apache.org/jira/browse/ACCUMULO-3700
> Project: Accumulo
>  Issue Type: New Feature
>Affects Versions: 1.6.2
>Reporter: Mike Fagan
>Priority: Minor
>  Labels: balancer, multi-tenant
>
> In a multi-tenant environment I would like to be able to segregate tables to 
> subset of tablet servers for processing isolation and maintaining SLAs.
> My thinking is this could be accomplished by defining a configuration that 
> maps namespaces to tablet servers and a custom balancer will utilize this 
> configuration to balance tablets only on the servers associated with its 
> namespace. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ACCUMULO-3700) Table Balancer support for multi-tennacy

2017-02-15 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868589#comment-15868589
 ] 

Ivan Bella commented on ACCUMULO-3700:
--

The HostRegexTableLoadBalancer implements this feature.

> Table Balancer support for multi-tennacy
> 
>
> Key: ACCUMULO-3700
> URL: https://issues.apache.org/jira/browse/ACCUMULO-3700
> Project: Accumulo
>  Issue Type: New Feature
>Affects Versions: 1.6.2
>Reporter: Mike Fagan
>Priority: Minor
>  Labels: balancer, multi-tenant
>
> In a multi-tenant environment I would like to be able to segregate tables to 
> subset of tablet servers for processing isolation and maintaining SLAs.
> My thinking is this could be accomplished by defining a configuration that 
> maps namespaces to tablet servers and a custom balancer will utilize this 
> configuration to balance tablets only on the servers associated with its 
> namespace. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ACCUMULO-835) not balancing because tablets are offline

2017-02-09 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859684#comment-15859684
 ] 

Ivan Bella commented on ACCUMULO-835:
-

Another similar scenario is when there are some long running major compactions 
running, which are holding up migrations, and subsequently all balancing 
activities are halted until that is complete.  This scenario is not uncommon 
for us on our larger clusters.  The workaround is to fail the compactions and 
bounce the tservers that are holding up the migrations.

> not balancing because tablets are offline
> -
>
> Key: ACCUMULO-835
> URL: https://issues.apache.org/jira/browse/ACCUMULO-835
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Reporter: Eric Newton
>Priority: Minor
>
> I started a large merge and chopping took a long time.  I added a new server, 
> and it didn't get any new tablets because the merge had several tablets 
> offline.  There needs to be a better heuristic for balancing even if a few 
> tablets are offline.  Perhaps run balancing if the imbalance is greater than 
> the number of offline tablets and there are no migrations underway?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ACCUMULO-4502) Called next when there is no top

2017-02-03 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851995#comment-15851995
 ] 

Ivan Bella commented on ACCUMULO-4502:
--

ok, I will get my head into this one again and think about some appropriate 
logging to improve debug-ability.

> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.7.3, 1.8.2, 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ACCUMULO-4502) Called next when there is no top

2017-02-03 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851813#comment-15851813
 ] 

Ivan Bella commented on ACCUMULO-4502:
--

I have not seen this occur in quite some time, nor have I been able to 
reproduce this issue.  I am leaning towards closing this as "cannot reproduce". 
 Thoughts?

> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.7.3, 1.8.2, 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ACCUMULO-4502) Called next when there is no top

2016-11-07 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15644610#comment-15644610
 ] 

Ivan Bella commented on ACCUMULO-4502:
--

I will give you a better answer Josh when I can get back to a computer.

> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.7.3, 1.8.1, 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ACCUMULO-4502) Called next when there is no top

2016-11-07 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15644586#comment-15644586
 ] 

Ivan Bella commented on ACCUMULO-4502:
--

OK, I believe you are correct.  The seeks are synchronized as well as the 
readNext in the SourceSwitchingIterator so volatile should not help.  I will 
decline the request and keep searching. 

> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.7.3, 1.8.1, 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ACCUMULO-4502) Called next when there is no top

2016-11-07 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15644371#comment-15644371
 ] 

Ivan Bella commented on ACCUMULO-4502:
--

The switch and subsequent pointer assignment  is done on one thread, and the 
pointer is read on another.  This is the only explanation I can come up with 
unless the dumped file is somehow not identical to the original memory file.

> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.7.3, 1.8.1, 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ACCUMULO-4502) Called next when there is no top

2016-11-07 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15644068#comment-15644068
 ] 

Ivan Bella commented on ACCUMULO-4502:
--

I thing you said it when you said "re-initialized to use the local memdump file 
and then re-seek()'ed."  It is the seek call that results in everything getting 
setup.

> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.7.3, 1.8.1, 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ACCUMULO-4502) Called next when there is no top

2016-11-04 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637584#comment-15637584
 ] 

Ivan Bella commented on ACCUMULO-4502:
--

I can guarantee this does not break anything, but I cannot yet guarantee it 
fixes the problem.  It might take me a little while to get it onto a system 
that has shown this issue, and then a while before I can say it has fixed the 
issue.  I need the community to verify that my logic is sound and that this is 
a possible hole.

> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.7.3, 1.8.1, 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ACCUMULO-4502) Called next when there is no top

2016-11-04 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637558#comment-15637558
 ] 

Ivan Bella commented on ACCUMULO-4502:
--

The switchSource is definitly done on a separate thread during minor 
compaction.  So the answer would be yes.  Access is synchronized appropriately 
in the SourceSwitchingIterator, but this would result in one thread setting a 
member variable and another one using immediately afterwards.

> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.7.3, 1.8.1, 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ACCUMULO-4502) Called next when there is no top

2016-11-04 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637382#comment-15637382
 ] 

Ivan Bella commented on ACCUMULO-4502:
--

After an exhaustive search through this code, the only reason I can come up 
with is an issue with a member reference in the underlying HeapIterator. The 
scenario would be as follows:

thread 1: calls hasTop() which is invoked on then in memory data source
thread 2: calls switchSource which sets up the new datasource and calls seek 
which results in topIdx being set
thread1: calls next() which gets a value of null for topIdx

So if topIdx is marked as volatile then this should not happen.


> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.7.3, 1.8.1, 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ACCUMULO-4502) Called next when there is no top

2016-10-27 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612927#comment-15612927
 ] 

Ivan Bella commented on ACCUMULO-4502:
--

The SourceSwitchingIterator logic seems to be sound.  The only thing I can now 
come up with is that somehow the seek executed post switching the underlying 
data source somehow did not find a key that it originally found in the previous 
source iterator.  That is actually a scary proposition so I hope it is wrong.

> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.7.3, 1.8.1, 2.0.0
>
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ACCUMULO-4502) Called next when there is no top

2016-10-27 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612718#comment-15612718
 ] 

Ivan Bella commented on ACCUMULO-4502:
--

I believe my hypothesis is not correct.  The switchNow method (which is the 
only one that calls _switchNow) synchronized on the local "copies" member which 
is shared across all deep copies.  The next method also synchronizes on this 
member and hence the seek should be performed correctly before the next method 
is actually invoked.  Still searching for the hole.

> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.7.3, 1.8.1, 2.0.0
>
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (ACCUMULO-4502) Called next when there is no top

2016-10-27 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella reassigned ACCUMULO-4502:


Assignee: Ivan Bella

> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>Assignee: Ivan Bella
> Fix For: 1.7.3, 1.8.1, 2.0.0
>
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (ACCUMULO-4502) Called next when there is no top

2016-10-24 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602717#comment-15602717
 ] 

Ivan Bella edited comment on ACCUMULO-4502 at 10/24/16 6:58 PM:


The condition in the HeapIterator is that the {{topIdx == null}} which implies 
that hasTop should have returned false.

In PartialMutationSkippingIterator.consume it is simply doing the following:

{code}
protected void consume() throws IOException
  while (getSource().hasTop() && ((MemKey) getSource().getTopKey()).kvCount > 
kvCount)
getSource().next();
}
{code}

So obviously it is calling hasTop() before calling next().  The rub is that 
there is a SourceSwitchingIterator in between the 
PartialMutationSkippingIterator and the HeapIterator at the bottom.  Most 
likely a minor compaction caused the SourceSwitchingIterator to switch via the 
switchNow method.  When that happens the following code in {{_switchNow()}} is 
invoked:

{code}
if (switchSource()) {
  if (key != null) {
iter.seek(...)
  }
}
{code}

So if {{getSource().getTop()}} was invoked in the 
PartialMutationSkippingIterator, then switchSource is called but BEFORE the 
nested {{if (key != null) iter.seek(...)}} is invoked the 
{{getSource().next()}} is invoked by the PartialMutationSkippingIterator: this 
scenario would leave the topIdx to be null in the HeapIterator (most likely the 
RFile.Reader at this point) and subsequently cause this exception.

(elserj: some wiki formatting for readability)


was (Author: ivan.bella):
The condition in the HeapIterator is that the {{topIdx == null}} which implies 
that hasTop should have returned false.

In PartialMutationSkippingIterator.consume it is simply doing the following:

{code}
protected void consume() throws IOException
  while (getSource().hasTop() && ((MemKey) getSource().getTopKey()).kvCount > 
kvCount)
getSource().next();
}
{code}

So obviously it is calling hasTop() before calling next().  The rub is that 
there is a SourceSwitchingIterator in between the 
PartialMutationSkippingIterator and the HeapIterator at the bottom.  Most 
likely a minor compaction caused the SourceSwitchingIterator to switch via the 
switchNow method.  When that happens the following code in {{_switchNow()}} is 
invoked:

{code}
if (switchSource()) {
  if (key != null) {
iter.seek(...)
  }
}
{code}

So if {{getSource().getTop()}} was invoked in the 
PartialMutationSkippingIterator, then switchSource is called but BEFORE the 
nested {{if (key != null) iter.seek(...)}} is invoked, the 
{{getSource().next()}} is invoked by the PartialMutationSkippingIterator: this 
scenario would leave the topIdx to be null in the HeapIterator (most likely the 
RFile.Reader at this point) and subsequently cause this exception.

(elserj: some wiki formatting for readability)

> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (ACCUMULO-4502) Called next when there is no top

2016-10-24 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602717#comment-15602717
 ] 

Ivan Bella edited comment on ACCUMULO-4502 at 10/24/16 6:57 PM:


The condition in the HeapIterator is that the {{topIdx == null}} which implies 
that hasTop should have returned false.

In PartialMutationSkippingIterator.consume it is simply doing the following:

{code}
protected void consume() throws IOException
  while (getSource().hasTop() && ((MemKey) getSource().getTopKey()).kvCount > 
kvCount)
getSource().next();
}
{code}

So obviously it is calling hasTop() before calling next().  The rub is that 
there is a SourceSwitchingIterator in between the 
PartialMutationSkippingIterator and the HeapIterator at the bottom.  Most 
likely a minor compaction caused the SourceSwitchingIterator to switch via the 
switchNow method.  When that happens the following code in {{_switchNow()}} is 
invoked:

{code}
if (switchSource()) {
  if (key != null) {
iter.seek(...)
  }
}
{code}

So if {{getSource().getTop()}} was invoked in the 
PartialMutationSkippingIterator, then switchSource is called but BEFORE the 
nested {{if (key != null) iter.seek(...)}} is invoked, the 
{{getSource().next()}} is invoked by the PartialMutationSkippingIterator: this 
scenario would leave the topIdx to be null in the HeapIterator (most likely the 
RFile.Reader at this point) and subsequently cause this exception.

(elserj: some wiki formatting for readability)


was (Author: ivan.bella):
The condition in the HeapIterator is that the {{topIdx == null}} which implies 
that hasTop should have returned false.

In PartialMutationSkippingIterator.consume it is simply doing the following:

{code}
protected void consume() throws IOException
  while (getSource().hasTop() && ((MemKey) getSource().getTopKey()).kvCount > 
kvCount)
getSource().next();
}
{code}

So obviously it is calling hasTop() before calling next().  The rub is that 
there is a SourceSwitchingIterator in between the 
PartialMutationSkippingIterator and the HeapIterator at the bottom.  Most 
likely a minor compaction caused the SourceSwitchingIterator to switch via the 
switchNow method.  When that happens the following code in {{_switchNow()}} is 
invoked:

{code}
if (switchSource()) {
  if (key != null) {
iter.seek(...)
  }
}
{code}

So if {{getSource().getTop()}} was invoked in the 
PartialMutationSkippingIterator, then switchSource is called but BEFORE the 
nested {{ if (key != null) \{ iter.seek(...) \} }} is invoked, the 
{{getSource().next()}} is invoked by the PartialMutationSkippingIterator: this 
scenario would leave the topIdx to be null in the HeapIterator (most likely the 
RFile.Reader at this point) and subsequently cause this exception.

(elserj: some wiki formatting for readability)

> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (ACCUMULO-4502) Called next when there is no top

2016-10-24 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602717#comment-15602717
 ] 

Ivan Bella edited comment on ACCUMULO-4502 at 10/24/16 6:57 PM:


The condition in the HeapIterator is that the {{topIdx == null}} which implies 
that hasTop should have returned false.

In PartialMutationSkippingIterator.consume it is simply doing the following:

{code}
protected void consume() throws IOException
  while (getSource().hasTop() && ((MemKey) getSource().getTopKey()).kvCount > 
kvCount)
getSource().next();
}
{code}

So obviously it is calling hasTop() before calling next().  The rub is that 
there is a SourceSwitchingIterator in between the 
PartialMutationSkippingIterator and the HeapIterator at the bottom.  Most 
likely a minor compaction caused the SourceSwitchingIterator to switch via the 
switchNow method.  When that happens the following code in {{_switchNow()}} is 
invoked:

{code}
if (switchSource()) {
  if (key != null) {
iter.seek(...)
  }
}
{code}

So if {{getSource().getTop()}} was invoked in the 
PartialMutationSkippingIterator, then switchSource is called but BEFORE the 
nested {{ if (key != null) \{ iter.seek(...) \} }} is invoked, the 
{{getSource().next()}} is invoked by the PartialMutationSkippingIterator: this 
scenario would leave the topIdx to be null in the HeapIterator (most likely the 
RFile.Reader at this point) and subsequently cause this exception.

(elserj: some wiki formatting for readability)


was (Author: ivan.bella):
The condition in the HeapIterator is that the {{topIdx == null}} which implies 
that hasTop should have returned false.

In PartialMutationSkippingIterator.consume it is simply doing the following:

{code}
protected void consume() throws IOException
  while (getSource().hasTop() && ((MemKey) getSource().getTopKey()).kvCount > 
kvCount)
getSource().next();
}
{code}

So obviously it is calling hasTop() before calling next().  The rub is that 
there is a SourceSwitchingIterator in between the 
PartialMutationSkippingIterator and the HeapIterator at the bottom.  Most 
likely a minor compaction caused the SourceSwitchingIterator to switch via the 
switchNow method.  When that happens the following code in {{_switchNow()}} is 
invoked:

{code}
if (switchSource()) {
  if (key != null) {
iter.seek(...)
  }
}
{code}

So if {{getSource().getTop()}} was invoked in the 
PartialMutationSkippingIterator, then switchSource is called but BEFORE the 
nested {{ if (key != null) \{ iter.seek(...) } }} is invoked, the 
{{getSource().next()}} is invoked by the PartialMutationSkippingIterator: this 
scenario would leave the topIdx to be null in the HeapIterator (most likely the 
RFile.Reader at this point) and subsequently cause this exception.

(elserj: some wiki formatting for readability)

> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (ACCUMULO-4502) Called next when there is no top

2016-10-24 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602717#comment-15602717
 ] 

Ivan Bella edited comment on ACCUMULO-4502 at 10/24/16 6:56 PM:


The condition in the HeapIterator is that the {{topIdx == null}} which implies 
that hasTop should have returned false.

In PartialMutationSkippingIterator.consume it is simply doing the following:

{code}
protected void consume() throws IOException
  while (getSource().hasTop() && ((MemKey) getSource().getTopKey()).kvCount > 
kvCount)
getSource().next();
}
{code}

So obviously it is calling hasTop() before calling next().  The rub is that 
there is a SourceSwitchingIterator in between the 
PartialMutationSkippingIterator and the HeapIterator at the bottom.  Most 
likely a minor compaction caused the SourceSwitchingIterator to switch via the 
switchNow method.  When that happens the following code in {{_switchNow()}} is 
invoked:

{code}
if (switchSource()) {
  if (key != null) {
iter.seek(...)
  }
}
{code}

So if {{getSource().getTop()}} was invoked in the 
PartialMutationSkippingIterator, then switchSource is called but BEFORE the 
nested {{ if (key != null) \{ iter.seek(...) } }} is invoked, the 
{{getSource().next()}} is invoked by the PartialMutationSkippingIterator: this 
scenario would leave the topIdx to be null in the HeapIterator (most likely the 
RFile.Reader at this point) and subsequently cause this exception.

(elserj: some wiki formatting for readability)


was (Author: ivan.bella):
The condition in the HeapIterator is that the {{topIdx == null}} which implies 
that hasTop should have returned false.

In PartialMutationSkippingIterator.consume it is simply doing the following:

{code}
protected void consume() throws IOException
  while (getSource().hasTop() && ((MemKey) getSource().getTopKey()).kvCount > 
kvCount)
getSource().next();
}
{code}

So obviously it is calling hasTop() before calling next().  The rub is that 
there is a SourceSwitchingIterator in between the 
PartialMutationSkippingIterator and the HeapIterator at the bottom.  Most 
likely a minor compaction caused the SourceSwitchingIterator to switch via the 
switchNow method.  When that happens the following code in {{_switchNow()}} is 
invoked:

{code}
if (switchSource()) {
  if (key != null) {
iter.seek(...)
  }
}
{code}

So if {{getSource().getTop()}} was invoked in the 
PartialMutationSkippingIterator, then switchSource is called but BEFORE the 
nested `if (key != null) \{ iter.seek(...) }` is invoked, the 
getSource().next() is invoked by the PartialMutationSkippingIterator: this 
scenario would leave the topIdx to be null in the HeapIterator (most likely the 
RFile.Reader at this point) and subsequently cause this exception.

(elserj: some wiki formatting for readability)

> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (ACCUMULO-4502) Called next when there is no top

2016-10-24 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602717#comment-15602717
 ] 

Ivan Bella edited comment on ACCUMULO-4502 at 10/24/16 5:57 PM:


The condition in the HeapIterator is that the topIdx == null which implies that 
hasTop should have returned false.

In PartialMutationSkippingIterator.consume it is simply doing the following:

protected void consume() throws IOException
  while (getSource().hasTop() && ((MemKey) getSource().getTopKey()).kvCount > 
kvCount)
getSource().next();
}

So obviously it is calling hasTop() before calling next().  The rub is that 
there is a SourceSwitchingIterator in between the 
PartialMutationSkippingIterator and the HeapIterator at the bottom.  Most 
likely a minor compaction caused the SourceSwitchingIterator to switch via the 
switchNow method.  When that happens the following code in _switchNow() is 
invoked:

if (switchSource()) {
  if (key != null) {
iter.seek(...)
  }
}

So if getSource().getTop() was invoked in the PartialMutationSkippingIterator, 
then switchSource is called but BEFORE the nested if (key != null) { 
iter.seek(...) } is invoked, the getSource().next() is invoked by the 
PartialMutationSkippingIterator: this scenario would leave the topIdx to be 
null in the HeapIterator (most likely the RFile.Reader at this point) and 
subsequently cause this exception.


was (Author: ivan.bella):
The condition in the HeapIterator is that the topIdx == null which implies that 
hasTop should have returned false.

In PartialMutationSkippingIterator.consume it is simply doing the following:

protected void consume() throws IOException
  while (getSource().hasTop() && ((MemKey) getSource().getTopKey()).kvCount > 
kvCount)
getSource().next();
}

So obviously it is calling hasTop() before calling next().  The rub is that 
there is a SourceSwitchingIterator in between the 
PartialMutationSkippingIterator and the HeapIterator at the bottom.  Most 
likely a minor compaction caused the SourceSwitchingIterator to switch via the 
switchNow method.  When that happens the following code in _switchNow() is 
invoked:

if (switchSource()) {
  if (key != null) {
iter.seek(...)
  }
}

So if getSource().getTop() was invoked in the PartialMutationSkippingIterator, 
then switchSource is called but BEFORE the nested if (key != null) { 
iter.seek(...) } is invoked, the getSource().next() is invoked by the 
PartialMutationSkippingIterator.  This scenario would leave the topIdx to be 
null in the HeapIterator (most likely the RFile.Reader at this point) and 
subsequently cause this exception.

> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ACCUMULO-4502) Called next when there is no top

2016-10-24 Thread Ivan Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602717#comment-15602717
 ] 

Ivan Bella commented on ACCUMULO-4502:
--

The condition in the HeapIterator is that the topIdx == null which implies that 
hasTop should have returned false.

In PartialMutationSkippingIterator.consume it is simply doing the following:

protected void consume() throws IOException
  while (getSource().hasTop() && ((MemKey) getSource().getTopKey()).kvCount > 
kvCount)
getSource().next();
}

So obviously it is calling hasTop() before calling next().  The rub is that 
there is a SourceSwitchingIterator in between the 
PartialMutationSkippingIterator and the HeapIterator at the bottom.  Most 
likely a minor compaction caused the SourceSwitchingIterator to switch via the 
switchNow method.  When that happens the following code in _switchNow() is 
invoked:

if (switchSource()) {
  if (key != null) {
iter.seek(...)
  }
}

So if getSource().getTop() was invoked in the PartialMutationSkippingIterator, 
then switchSource is called but BEFORE the nested if (key != null) { 
iter.seek(...) } is invoked, the getSource().next() is invoked by the 
PartialMutationSkippingIterator.  This scenario would leave the topIdx to be 
null in the HeapIterator (most likely the RFile.Reader at this point) and 
subsequently cause this exception.

> Called next when there is no top
> 
>
> Key: ACCUMULO-4502
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
> Project: Accumulo
>  Issue Type: Bug
>  Components: core, tserver
>Affects Versions: 1.6.6
>Reporter: Ivan Bella
>
> This happens very rarely but we have seen the following exception (pulled 
> from a server running 1.6.4).  Looking at the code I believe this condition 
> can still happen in 1.8.0:
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Called next() when there is no top
> ...
> Caused by: java.lang.IllegalStateException: Called next() when there is no top
> HeapIterator.next(HeapIterator.java: 77)
> WrappingIterator.next(WrappingIterator.java: 96)
> MemKeyConversionIterator.next(InMemoryMap.java:162)
> SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
> SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
> PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
> SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (ACCUMULO-4502) Called next when there is no top

2016-10-24 Thread Ivan Bella (JIRA)
Ivan Bella created ACCUMULO-4502:


 Summary: Called next when there is no top
 Key: ACCUMULO-4502
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4502
 Project: Accumulo
  Issue Type: Bug
  Components: core, tserver
Affects Versions: 1.6.6
Reporter: Ivan Bella


This happens very rarely but we have seen the following exception (pulled from 
a server running 1.6.4).  Looking at the code I believe this condition can 
still happen in 1.8.0:

java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
Called next() when there is no top
...
Caused by: java.lang.IllegalStateException: Called next() when there is no top
HeapIterator.next(HeapIterator.java: 77)
WrappingIterator.next(WrappingIterator.java: 96)
MemKeyConversionIterator.next(InMemoryMap.java:162)
SourceSwitchingIterator.readNext(SourceSwitchingIterator.java: 139)
SourceSwitchingIterator.next(SourceSwitchingIterator.java: 123)
PartialMutationSkippingIterator.consume(InMemoryMap.java:108)
SkippingIterator.seek(SkippingIterator.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (ACCUMULO-4455) Improve test coverage for thread-safety in iterator deep copies

2016-10-11 Thread Ivan Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bella reassigned ACCUMULO-4455:


Assignee: Ivan Bella

> Improve test coverage for thread-safety in iterator deep copies
> ---
>
> Key: ACCUMULO-4455
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4455
> Project: Accumulo
>  Issue Type: Test
>Affects Versions: 1.6.6
>Reporter: Christopher Tubbs
>Assignee: Ivan Bella
> Fix For: 1.7.3, 1.8.1, 2.0.0
>
>
> As a follow-up from ACCUMULO-4391, it was expressed on that issue that there 
> may be additional tests which could be created.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >