[ 
https://issues.apache.org/jira/browse/HBASE-26435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reid Chan resolved HBASE-26435.
-------------------------------
    Hadoop Flags: Reviewed
      Resolution: Fixed

> [branch-1] The log rolling request maybe canceled immediately in LogRoller 
> due to a race 
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-26435
>                 URL: https://issues.apache.org/jira/browse/HBASE-26435
>             Project: HBase
>          Issue Type: Sub-task
>          Components: wal
>    Affects Versions: 1.6.0
>            Reporter: Rushabh Shah
>            Assignee: Rushabh Shah
>            Priority: Major
>             Fix For: 1.7.2
>
>
> Saw this issue in our internal 1.6 branch.
> All the writes to this RS were getting failing since the underlying hdfs file 
> was corrupt. This healed after 1 hour (equivalent to 
> hbase.regionserver.logroll.period conf key). 
> The WAL  was rolled but the new WAL file was not writable and it logged the 
> following error also. 
> {noformat}
> 2021-11-03 19:20:19,503 WARN  [.168:60020.logRoller] hdfs.DFSClient - Error 
> while syncing
> java.io.IOException: Could not get block locations. Source file 
> "/hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635967219389"
>  - Aborting...
>         at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251)
>         at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670)
> 2021-11-03 19:20:19,507 WARN  [.168:60020.logRoller] wal.FSHLog - pre-sync 
> failed but an optimization so keep going
> java.io.IOException: Could not get block locations. Source file 
> "/hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635967219389"
>  - Aborting...
>         at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251)
>         at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670)
> {noformat}
> Since the new WAL file was not writable, appends to that file started failing 
> immediately it was rolled.
> {noformat}
> 2021-11-03 19:20:19,677 INFO  [.168:60020.logRoller] wal.FSHLog - Rolled WAL 
> /hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635965392022
>  with entries=253234, filesize=425.67 MB; new WAL 
> /hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635967219389
> 2021-11-03 19:20:19,690 WARN  [020.append-pool17-t1] wal.FSHLog - Append 
> sequenceId=1962661783, requesting roll of WAL
> java.io.IOException: Could not get block locations. Source file 
> "/hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635967219389"
>  - Aborting...
>         at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251)
>         at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670)
> 2021-11-03 19:20:19,690 INFO  [.168:60020.logRoller] wal.FSHLog - Archiving 
> hdfs://prod-EMPTY-hbase2a/hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635960792837
>  to 
> hdfs://prod-EMPTY-hbase2a/hbase/oldWALs/hbase2a-dnds1-232-ukb.ops.sfdc.net%2C60020%2C1635567166484.1635960792837
> {noformat}
> We always reset the rollLog flag within LogRoller thread after the rollWal 
> call is complete.
> Within FSHLog#rollWriter method, it does many things, like replacing the 
> writer and archiving old logs. If append thread fails to write to new file 
> while logRoller thread is cleaning old logs, we will miss the rollLog flag 
> since LogRoller will reset the flag to false while the previous rollWriter 
> call is going on.
> Relevant code: 
> https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java#L183-L203
> We need to reset rollLog flag before we start rolling the wal. 
> This is fixed in branch-2 and master via HBASE-22684 but we didn't fix it in 
> branch-1
> Also branch-2 has multi wal implementation so it can apply cleanly in 
> branch-1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to