[jira] [Commented] (ACCUMULO-4615) ThreadPool timeout when checking tserver stats is confusing

2018-03-16 Thread Jeff Schmidt (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402183#comment-16402183
 ] 

Jeff Schmidt commented on ACCUMULO-4615:


PR here: https://github.com/apache/accumulo/pull/402

> ThreadPool timeout when checking tserver stats is confusing
> ---
>
> Key: ACCUMULO-4615
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4615
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.8.1
>Reporter: Michael Wall
>Assignee: Jeff Schmidt
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0, 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If it takes longer than the configured time to gather information from all 
> the tablet servers, the thread pool stops and processing continues with 
> whatever has been collected.  Code is 
> https://github.com/apache/accumulo/blob/1.8/server/master/src/main/java/org/apache/accumulo/master/Master.java#L1120,
>  default timeout is 6s.  Does not appear to be an issue prior to 1.8.
> Best case, this was really confusing.  The monitor page would have 30 
> tservers, then 5 tservers.  Didn't really see any other negative effects, no 
> migrations and no balancing appeared to be affected.  Worse case though, I 
> missed something and the master is making decisions based on incomplete 
> information.
> [~dlmar...@comcast.net] please add more info if needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ACCUMULO-4615) ThreadPool timeout when checking tserver stats is confusing

2018-02-22 Thread Jeff Schmidt (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373267#comment-16373267
 ] 

Jeff Schmidt commented on ACCUMULO-4615:


Sorry for the delay on this. I have an initial fix here: 
[https://github.com/jschmidt10/accumulo/commit/ce3ffae0e85f0b314af2401fd0dd054b51a51277]

I will be testing it on a deployed system shortly but any early feedback is 
appreciated too.

The general idea is to 

1) Use a timeout per status gathering task (instead of a timeout for the entire 
pool)
2) Changed the status gather results to a threadsafe data structure 
(ConcurrentSkipListMap)
3) Added separate property for the status timeout (per tserver)

> ThreadPool timeout when checking tserver stats is confusing
> ---
>
> Key: ACCUMULO-4615
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4615
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.8.1
>Reporter: Michael Wall
>Assignee: Jeff Schmidt
>Priority: Minor
> Fix For: 1.9.0, 2.0.0
>
>
> If it takes longer than the configured time to gather information from all 
> the tablet servers, the thread pool stops and processing continues with 
> whatever has been collected.  Code is 
> https://github.com/apache/accumulo/blob/1.8/server/master/src/main/java/org/apache/accumulo/master/Master.java#L1120,
>  default timeout is 6s.  Does not appear to be an issue prior to 1.8.
> Best case, this was really confusing.  The monitor page would have 30 
> tservers, then 5 tservers.  Didn't really see any other negative effects, no 
> migrations and no balancing appeared to be affected.  Worse case though, I 
> missed something and the master is making decisions based on incomplete 
> information.
> [~dlmar...@comcast.net] please add more info if needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ACCUMULO-4615) ThreadPool timeout when checking tserver stats is confusing

2017-12-29 Thread Jeff Schmidt (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16306395#comment-16306395
 ] 

Jeff Schmidt commented on ACCUMULO-4615:


Hi! I just wanted to ping on this ticket as our project ran into an issue where 
the mentioned code caused a ConcurrentModificationException during the Master 
attempting to balance causing several "Error balancing tablets, will wait for 1 
(seconds) and then retry".

In Master#StatusThread.updateStatus(), we do:
{code}
tserverStatus = 
Collections.synchronizedSortedMap(gatherTableInformation(currentServers));
{code}

That resulting map gets used by the balancer. However, even though it's wrapped 
in a synchronized map and an unmodifiable map, the inner map (the one returned 
by gatherTableInformation()) is potentially being concurrently modified due to 
the way we shutdown the status thread pool.

In StatusThread.gatherTableInformation(), we do:
{code}
tp.shutdown();
try {
  tp.awaitTermination();
} catch (InterruptedException e) {
  log.debug(...);
}
{code}

The shutdown request allows submitted tasks to continue but prevents new tasks 
from being submitted. So slow reporting tservers can still modify this list 
after this method returns, leading to the ConcurrentModificationException.

I'd be happy to work on a fix if no one has started tackling this yet.

> ThreadPool timeout when checking tserver stats is confusing
> ---
>
> Key: ACCUMULO-4615
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4615
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.8.1
>Reporter: Michael Wall
>Priority: Minor
> Fix For: 1.8.2, 2.0.0
>
>
> If it takes longer than the configured time to gather information from all 
> the tablet servers, the thread pool stops and processing continues with 
> whatever has been collected.  Code is 
> https://github.com/apache/accumulo/blob/1.8/server/master/src/main/java/org/apache/accumulo/master/Master.java#L1120,
>  default timeout is 6s.  Does not appear to be an issue prior to 1.8.
> Best case, this was really confusing.  The monitor page would have 30 
> tservers, then 5 tservers.  Didn't really see any other negative effects, no 
> migrations and no balancing appeared to be affected.  Worse case though, I 
> missed something and the master is making decisions based on incomplete 
> information.
> [~dlmar...@comcast.net] please add more info if needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ACCUMULO-4615) ThreadPool timeout when checking tserver stats is confusing

2017-03-28 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945855#comment-15945855
 ] 

Josh Elser commented on ACCUMULO-4615:
--

bq. the monitoring was freaking out, showing different values for # tservers, # 
tablets, # offline tables.

Yuck.

bq. If it takes longer than the configured time to gather information from all 
the tablet servers, the thread pool stops and processing continues with 
whatever has been collected

Maybe stats should be collected in the background on some interval instead of 
on-demand? In the case where we don't get a response in some threshold, we 
could fall back to the previous value?

FYI [~lstav] as this might be of interest to you in the monitor-reworking on 
master.



> ThreadPool timeout when checking tserver stats is confusing
> ---
>
> Key: ACCUMULO-4615
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4615
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.8.1
>Reporter: Michael Wall
>Priority: Minor
>
> If it takes longer than the configured time to gather information from all 
> the tablet servers, the thread pool stops and processing continues with 
> whatever has been collected.  Code is 
> https://github.com/apache/accumulo/blob/1.8/server/master/src/main/java/org/apache/accumulo/master/Master.java#L1120,
>  default timeout is 6s.  Does not appear to be an issue prior to 1.8.
> Best case, this was really confusing.  The monitor page would have 30 
> tservers, then 5 tservers.  Didn't really see any other negative effects, no 
> migrations and no balancing appeared to be affected.  Worse case though, I 
> missed something and the master is making decisions based on incomplete 
> information.
> [~dlmar...@comcast.net] please add more info if needed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ACCUMULO-4615) ThreadPool timeout when checking tserver stats is confusing

2017-03-28 Thread Dave Marion (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945815#comment-15945815
 ] 

Dave Marion commented on ACCUMULO-4615:
---

[~mjwall] You got most of it, basically the monitoring was freaking out, 
showing different values for # tservers, # tablets, # offline tables. Default 
timeout is 2 * client timeout (default 3s). Not sure if increasing the client 
timeout will cause other issues, or if we should be using a different property.

> ThreadPool timeout when checking tserver stats is confusing
> ---
>
> Key: ACCUMULO-4615
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4615
> Project: Accumulo
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.8.1
>Reporter: Michael Wall
>Priority: Minor
>
> If it takes longer than the configured time to gather information from all 
> the tablet servers, the thread pool stops and processing continues with 
> whatever has been collected.  Code is 
> https://github.com/apache/accumulo/blob/1.8/server/master/src/main/java/org/apache/accumulo/master/Master.java#L1120,
>  default timeout is 6s.  Does not appear to be an issue prior to 1.8.
> Best case, this was really confusing.  The monitor page would have 30 
> tservers, then 5 tservers.  Didn't really see any other negative effects, no 
> migrations and no balancing appeared to be affected.  Worse case though, I 
> missed something and the master is making decisions based on incomplete 
> information.
> [~dlmar...@comcast.net] please add more info if needed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)