[jira] [Commented] (ACCUMULO-4615) ThreadPool timeout when checking tserver stats is confusing
[ https://issues.apache.org/jira/browse/ACCUMULO-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402183#comment-16402183 ] Jeff Schmidt commented on ACCUMULO-4615: PR here: https://github.com/apache/accumulo/pull/402 > ThreadPool timeout when checking tserver stats is confusing > --- > > Key: ACCUMULO-4615 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4615 > Project: Accumulo > Issue Type: Bug > Components: master >Affects Versions: 1.8.1 >Reporter: Michael Wall >Assignee: Jeff Schmidt >Priority: Minor > Labels: pull-request-available > Fix For: 1.9.0, 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > If it takes longer than the configured time to gather information from all > the tablet servers, the thread pool stops and processing continues with > whatever has been collected. Code is > https://github.com/apache/accumulo/blob/1.8/server/master/src/main/java/org/apache/accumulo/master/Master.java#L1120, > default timeout is 6s. Does not appear to be an issue prior to 1.8. > Best case, this was really confusing. The monitor page would have 30 > tservers, then 5 tservers. Didn't really see any other negative effects, no > migrations and no balancing appeared to be affected. Worse case though, I > missed something and the master is making decisions based on incomplete > information. > [~dlmar...@comcast.net] please add more info if needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ACCUMULO-4615) ThreadPool timeout when checking tserver stats is confusing
[ https://issues.apache.org/jira/browse/ACCUMULO-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373267#comment-16373267 ] Jeff Schmidt commented on ACCUMULO-4615: Sorry for the delay on this. I have an initial fix here: [https://github.com/jschmidt10/accumulo/commit/ce3ffae0e85f0b314af2401fd0dd054b51a51277] I will be testing it on a deployed system shortly but any early feedback is appreciated too. The general idea is to 1) Use a timeout per status gathering task (instead of a timeout for the entire pool) 2) Changed the status gather results to a threadsafe data structure (ConcurrentSkipListMap) 3) Added separate property for the status timeout (per tserver) > ThreadPool timeout when checking tserver stats is confusing > --- > > Key: ACCUMULO-4615 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4615 > Project: Accumulo > Issue Type: Bug > Components: master >Affects Versions: 1.8.1 >Reporter: Michael Wall >Assignee: Jeff Schmidt >Priority: Minor > Fix For: 1.9.0, 2.0.0 > > > If it takes longer than the configured time to gather information from all > the tablet servers, the thread pool stops and processing continues with > whatever has been collected. Code is > https://github.com/apache/accumulo/blob/1.8/server/master/src/main/java/org/apache/accumulo/master/Master.java#L1120, > default timeout is 6s. Does not appear to be an issue prior to 1.8. > Best case, this was really confusing. The monitor page would have 30 > tservers, then 5 tservers. Didn't really see any other negative effects, no > migrations and no balancing appeared to be affected. Worse case though, I > missed something and the master is making decisions based on incomplete > information. > [~dlmar...@comcast.net] please add more info if needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ACCUMULO-4615) ThreadPool timeout when checking tserver stats is confusing
[ https://issues.apache.org/jira/browse/ACCUMULO-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16306395#comment-16306395 ] Jeff Schmidt commented on ACCUMULO-4615: Hi! I just wanted to ping on this ticket as our project ran into an issue where the mentioned code caused a ConcurrentModificationException during the Master attempting to balance causing several "Error balancing tablets, will wait for 1 (seconds) and then retry". In Master#StatusThread.updateStatus(), we do: {code} tserverStatus = Collections.synchronizedSortedMap(gatherTableInformation(currentServers)); {code} That resulting map gets used by the balancer. However, even though it's wrapped in a synchronized map and an unmodifiable map, the inner map (the one returned by gatherTableInformation()) is potentially being concurrently modified due to the way we shutdown the status thread pool. In StatusThread.gatherTableInformation(), we do: {code} tp.shutdown(); try { tp.awaitTermination(); } catch (InterruptedException e) { log.debug(...); } {code} The shutdown request allows submitted tasks to continue but prevents new tasks from being submitted. So slow reporting tservers can still modify this list after this method returns, leading to the ConcurrentModificationException. I'd be happy to work on a fix if no one has started tackling this yet. > ThreadPool timeout when checking tserver stats is confusing > --- > > Key: ACCUMULO-4615 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4615 > Project: Accumulo > Issue Type: Bug > Components: master >Affects Versions: 1.8.1 >Reporter: Michael Wall >Priority: Minor > Fix For: 1.8.2, 2.0.0 > > > If it takes longer than the configured time to gather information from all > the tablet servers, the thread pool stops and processing continues with > whatever has been collected. Code is > https://github.com/apache/accumulo/blob/1.8/server/master/src/main/java/org/apache/accumulo/master/Master.java#L1120, > default timeout is 6s. Does not appear to be an issue prior to 1.8. > Best case, this was really confusing. The monitor page would have 30 > tservers, then 5 tservers. Didn't really see any other negative effects, no > migrations and no balancing appeared to be affected. Worse case though, I > missed something and the master is making decisions based on incomplete > information. > [~dlmar...@comcast.net] please add more info if needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ACCUMULO-4615) ThreadPool timeout when checking tserver stats is confusing
[ https://issues.apache.org/jira/browse/ACCUMULO-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945855#comment-15945855 ] Josh Elser commented on ACCUMULO-4615: -- bq. the monitoring was freaking out, showing different values for # tservers, # tablets, # offline tables. Yuck. bq. If it takes longer than the configured time to gather information from all the tablet servers, the thread pool stops and processing continues with whatever has been collected Maybe stats should be collected in the background on some interval instead of on-demand? In the case where we don't get a response in some threshold, we could fall back to the previous value? FYI [~lstav] as this might be of interest to you in the monitor-reworking on master. > ThreadPool timeout when checking tserver stats is confusing > --- > > Key: ACCUMULO-4615 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4615 > Project: Accumulo > Issue Type: Bug > Components: master >Affects Versions: 1.8.1 >Reporter: Michael Wall >Priority: Minor > > If it takes longer than the configured time to gather information from all > the tablet servers, the thread pool stops and processing continues with > whatever has been collected. Code is > https://github.com/apache/accumulo/blob/1.8/server/master/src/main/java/org/apache/accumulo/master/Master.java#L1120, > default timeout is 6s. Does not appear to be an issue prior to 1.8. > Best case, this was really confusing. The monitor page would have 30 > tservers, then 5 tservers. Didn't really see any other negative effects, no > migrations and no balancing appeared to be affected. Worse case though, I > missed something and the master is making decisions based on incomplete > information. > [~dlmar...@comcast.net] please add more info if needed. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ACCUMULO-4615) ThreadPool timeout when checking tserver stats is confusing
[ https://issues.apache.org/jira/browse/ACCUMULO-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945815#comment-15945815 ] Dave Marion commented on ACCUMULO-4615: --- [~mjwall] You got most of it, basically the monitoring was freaking out, showing different values for # tservers, # tablets, # offline tables. Default timeout is 2 * client timeout (default 3s). Not sure if increasing the client timeout will cause other issues, or if we should be using a different property. > ThreadPool timeout when checking tserver stats is confusing > --- > > Key: ACCUMULO-4615 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4615 > Project: Accumulo > Issue Type: Bug > Components: master >Affects Versions: 1.8.1 >Reporter: Michael Wall >Priority: Minor > > If it takes longer than the configured time to gather information from all > the tablet servers, the thread pool stops and processing continues with > whatever has been collected. Code is > https://github.com/apache/accumulo/blob/1.8/server/master/src/main/java/org/apache/accumulo/master/Master.java#L1120, > default timeout is 6s. Does not appear to be an issue prior to 1.8. > Best case, this was really confusing. The monitor page would have 30 > tservers, then 5 tservers. Didn't really see any other negative effects, no > migrations and no balancing appeared to be affected. Worse case though, I > missed something and the master is making decisions based on incomplete > information. > [~dlmar...@comcast.net] please add more info if needed. -- This message was sent by Atlassian JIRA (v6.3.15#6346)