[jira] [Commented] (HBASE-22289) WAL-based log splitting resubmit threshold results in a task being stuck forever

2019-04-22 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823507#comment-16823507
 ] 

Sergey Shelukhin commented on HBASE-22289:
--

Actually nm, ZK error was a red herring (although it could still happen if 
other ZK update fails; in this case it was actually a take-ownership that 
failed, and it succeeded on retry).
I see that RS logged handler.WALSplitterHandler: task execution preempted for 
this task.
That led me to this helpful TODO:
{noformat}
// TODO have to correctly figure out when log splitting has been
// interrupted or has encountered a transient error and when it has
// encountered a bad non-retry-able persistent error.
  if (!WALSplitter.splitLogFile(walDir, fs.getFileStatus(new Path(walDir, 
name)), fs, conf, p,
sequenceIdChecker, splitLogWorkerCoordination, factory)) {
return Status.PREEMPTED;
  }
{noformat}
When task is preempted RS in fact never updates task for the master. I'm not 
sure what the logic is behind that, what is master supposed to gain from not 
having the updated information.

> WAL-based log splitting resubmit threshold results in a task being stuck 
> forever
> 
>
> Key: HBASE-22289
> URL: https://issues.apache.org/jira/browse/HBASE-22289
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> Not sure if this is handled better in procedure based WAL splitting; in any 
> case it affects versions before that.
> The problem is not in ZK as such but in internal state tracking in master, it 
> seems.
> Master:
> {noformat}
> 2019-04-21 01:49:49,584 INFO  
> [master/:17000.splitLogManager..Chore.1] 
> coordination.SplitLogManagerCoordination: Resubmitting task 
> .1555831286638
> {noformat}
> worker-rs, split fails 
> {noformat}
> 
> 2019-04-21 02:05:31,774 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/:17020-1] wal.WALSplitter: 
> Processed 24 edits across 2 regions; edits skipped=457; log 
> file=.1555831286638, length=2156363702, corrupted=false, progress 
> failed=true
> {noformat}
> Master (not sure about the delay of the acquired-message; at any rate it 
> seems to detect the failure fine from this server)
> {noformat}
> 2019-04-21 02:11:14,928 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination: Task .1555831286638 acquired 
> by ,17020,139815097
> 2019-04-21 02:19:41,264 INFO  
> [master/:17000.splitLogManager..Chore.1] 
> coordination.SplitLogManagerCoordination: Skipping resubmissions of task 
> .1555831286638 because threshold 3 reached
> {noformat}
> After that this task is stuck in the limbo forever with the old worker, and 
> never resubmitted. 
> RS never logs anything else for this task.
> Killing the RS on the worker unblocked the task and some other server did the 
> split very quickly, so seems like master doesn't clear the worker name in its 
> internal state when hitting the threshold... master never restarted so 
> restarting the master might have also cleared it.
> This is extracted from splitlogmanager log messages, note the times.
> {noformat}
> 2019-04-21 02:2   1555831286638=last_update = 1555837874928 last_version = 11 
> cur_worker_name = ,17020,139815097 status = in_progress 
> incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20, 
> 
> 2019-04-22 11:1   1555831286638=last_update = 1555837874928 last_version = 11 
> cur_worker_name = ,17020,139815097 status = in_progress 
> incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22289) WAL-based log splitting resubmit threshold results in a task being stuck forever

2019-04-22 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823503#comment-16823503
 ] 

Sergey Shelukhin commented on HBASE-22289:
--

So the root cause is that for this task in particular, RS'es ZK operation to 
report error failed permanently against ZK, this RS was having a bad time but 
then recovered. 
Basically ZK task update from RS is a critical message and if it is lost but RS 
doesn't die, master will forever think the task is in progress.
And if the resubmit threshold happens to be hit, it will never resubmit the "in 
progress" task. Hence it will be stuck until either worker RS or master die.

[~tianjingyun] Not sure if procedures-based implementation in HBASE-21588 is 
affected by a similar issue if the message from RS to master is lost. Worth 
checking, because OpenRegion proc has special handling for this, where RS 
reports to master and kills itself if the report fails (otherwise if the 
message is lost that too will be stuck forever)

That might be what needs to be done for ZK implementation too pre-2.2. The 
proper fix would be to get rid of imperative updates and report/handle current 
state, but it's too much of a change for older branches given that ZK impl is 
abandoned.

> WAL-based log splitting resubmit threshold results in a task being stuck 
> forever
> 
>
> Key: HBASE-22289
> URL: https://issues.apache.org/jira/browse/HBASE-22289
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> Not sure if this is handled better in procedure based WAL splitting; in any 
> case it affects versions before that.
> The problem is not in ZK as such but in internal state tracking in master, it 
> seems.
> Master:
> {noformat}
> 2019-04-21 01:49:49,584 INFO  
> [master/:17000.splitLogManager..Chore.1] 
> coordination.SplitLogManagerCoordination: Resubmitting task 
> .1555831286638
> {noformat}
> worker-rs, split fails 
> {noformat}
> 
> 2019-04-21 02:05:31,774 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/:17020-1] wal.WALSplitter: 
> Processed 24 edits across 2 regions; edits skipped=457; log 
> file=.1555831286638, length=2156363702, corrupted=false, progress 
> failed=true
> {noformat}
> Master (not sure about the delay of the acquired-message; at any rate it 
> seems to detect the failure fine from this server)
> {noformat}
> 2019-04-21 02:11:14,928 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination: Task .1555831286638 acquired 
> by ,17020,139815097
> 2019-04-21 02:19:41,264 INFO  
> [master/:17000.splitLogManager..Chore.1] 
> coordination.SplitLogManagerCoordination: Skipping resubmissions of task 
> .1555831286638 because threshold 3 reached
> {noformat}
> After that this task is stuck in the limbo forever with the old worker, and 
> never resubmitted. 
> RS never logs anything else for this task.
> Killing the RS on the worker unblocked the task and some other server did the 
> split very quickly, so seems like master doesn't clear the worker name in its 
> internal state when hitting the threshold... master never restarted so 
> restarting the master might have also cleared it.
> This is extracted from splitlogmanager log messages, note the times.
> {noformat}
> 2019-04-21 02:2   1555831286638=last_update = 1555837874928 last_version = 11 
> cur_worker_name = ,17020,139815097 status = in_progress 
> incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20, 
> 
> 2019-04-22 11:1   1555831286638=last_update = 1555837874928 last_version = 11 
> cur_worker_name = ,17020,139815097 status = in_progress 
> incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22289) WAL-based log splitting resubmit threshold results in a task being stuck forever

2019-04-22 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823487#comment-16823487
 ] 

Sergey Shelukhin commented on HBASE-22289:
--

I see that for other tasks that reached threshold, eventually an error is 
detected:
2019-04-21 02:36:52,951 INFO  [main-EventThread] 
coordination.SplitLogManagerCoordination: Skipping resubmissions of task 
.1555832335124 because threshold 3 reached
2019-04-21 02:36:52,951 WARN  [main-EventThread] 
coordination.SplitLogManagerCoordination: Error splitting .1555832335124

However, no error is logged for the above task. It stays in the tasks map, but 
after the threshold flag is set, it's never picked up by the timer again.

> WAL-based log splitting resubmit threshold results in a task being stuck 
> forever
> 
>
> Key: HBASE-22289
> URL: https://issues.apache.org/jira/browse/HBASE-22289
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> Not sure if this is handled better in procedure based WAL splitting; in any 
> case it affects versions before that.
> The problem is not in ZK as such but in internal state tracking in master, it 
> seems.
> Master:
> {noformat}
> 2019-04-21 01:49:49,584 INFO  
> [master/:17000.splitLogManager..Chore.1] 
> coordination.SplitLogManagerCoordination: Resubmitting task 
> .1555831286638
> {noformat}
> worker-rs, split fails 
> {noformat}
> 
> 2019-04-21 02:05:31,774 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/:17020-1] wal.WALSplitter: 
> Processed 24 edits across 2 regions; edits skipped=457; log 
> file=.1555831286638, length=2156363702, corrupted=false, progress 
> failed=true
> {noformat}
> Master (not sure about the delay of the acquired-message; at any rate it 
> seems to detect the failure fine from this server)
> {noformat}
> 2019-04-21 02:11:14,928 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination: Task .1555831286638 acquired 
> by ,17020,139815097
> 2019-04-21 02:19:41,264 INFO  
> [master/:17000.splitLogManager..Chore.1] 
> coordination.SplitLogManagerCoordination: Skipping resubmissions of task 
> .1555831286638 because threshold 3 reached
> {noformat}
> After that this task is stuck in the limbo forever with the old worker, and 
> never resubmitted. 
> RS never logs anything else for this task.
> Killing the RS on the worker unblocked the task and some other server did the 
> split very quickly, so seems like master doesn't clear the worker name in its 
> internal state when hitting the threshold... master never restarted so 
> restarting the master might have also cleared it.
> This is extracted from splitlogmanager log messages, note the times.
> {noformat}
> 2019-04-21 02:2   1555831286638=last_update = 1555837874928 last_version = 11 
> cur_worker_name = ,17020,139815097 status = in_progress 
> incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20, 
> 
> 2019-04-22 11:1   1555831286638=last_update = 1555837874928 last_version = 11 
> cur_worker_name = ,17020,139815097 status = in_progress 
> incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)