[jira] [Comment Edited] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?

2022-01-18 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477853#comment-17477853
 ] 

Viraj Jasani edited comment on HBASE-20503 at 1/18/22, 1:28 PM:


Upgrading hbase 1.6.0 to 2.4.6/7, facing the similar issue:
{code:java}
2022-01-18 12:45:59,051 WARN  [Close-WAL-Writer-4] 
wal.AbstractProtobufLogWriter - Failed to write trailer, non-fatal, 
continuing...
java.io.IOException: stream already broken
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509)
at 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.lambda$writeWALTrailerAndMagic$3(AsyncProtobufLogWriter.java:231)
at 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:187)
at 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALTrailerAndMagic(AsyncProtobufLogWriter.java:222)
at 
org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.writeWALTrailer(AbstractProtobufLogWriter.java:261)
at 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:157)
at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$closeWriter$5(AsyncFSWAL.java:698)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) {code}
{code:java}
2022-01-18 12:45:59,051 WARN  [Close-WAL-Writer-4] wal.AsyncProtobufLogWriter - 
normal close failed, try recover
java.lang.IllegalStateException: should call flush first before calling close
at 
org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:510)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.endBlock(FanOutOneBlockAsyncDFSOutput.java:514)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.close(FanOutOneBlockAsyncDFSOutput.java:565)
at 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:158)
at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$closeWriter$5(AsyncFSWAL.java:698)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) {code}
{code:java}
2022-01-18 12:46:14,054 WARN  [hbase] wal.AsyncFSWAL - sync failed
java.io.IOException: Timeout(15000ms) waiting for response
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.lambda$userEventTriggered$4(FanOutOneBlockAsyncDFSOutput.java:300)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.failed(FanOutOneBlockAsyncDFSOutput.java:233)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.access$300(FanOutOneBlockAsyncDFSOutput.java:98)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.userEventTriggered(FanOutOneBlockAsyncDFSOutput.java:299)
 {code}
Clients no longer able to continue ingestion due to write failures, happens 
only after masters are upgraded to 2.4.x. Hadoop running on 2.10.1, no 
datanodes or namenodes restarted.

Two frequent WARN logs observed when AsyncFSWAL came into picture after hmaster 
upgrade:
{code:java}
2022-01-18 12:41:58,411 WARN  [ype=LAST_IN_PIPELINE] datanode.DataNode - 
IOException in BlockReceiver.run(): 
java.io.IOException: Shutting down writer and responder due to a checksum error 
in received data. The error response has been sent upstream.
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1647)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1565)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1478)
at java.lang.Thread.run(Thread.java:748) {code}
{code:java}
2022-01-18 11:45:25,779 WARN  [t/disk1/hdfs/current] 
impl.FsDatasetAsyncDiskService - sync_file_range error
EBADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.sync_file_range(Native 
Method)
at 
org.apache.hadoop.io.nativeio.NativeIO$POSIX.syncFileRangeIfPossible(NativeIO.java:287)
at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.syncFileRange(FileIoProvider.java:189)
at 

[jira] [Comment Edited] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?

2021-08-26 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405575#comment-17405575
 ] 

Michael Stack edited comment on HBASE-20503 at 8/27/21, 4:47 AM:
-

[~kingWang] -have you looked in your datanode logs? In our case, the datanode 
was restarting unexpectedly – resource constraint.- <= Ignore. Wrong context.


was (Author: stack):
[~kingWang] have you looked in your datanode logs? In our case, the datanode 
was restarting unexpectedly – resource constraint.

> [AsyncFSWAL] Failed to get sync result after 30 ms for txid=160912, WAL 
> system stuck?
> -
>
> Key: HBASE-20503
> URL: https://issues.apache.org/jira/browse/HBASE-20503
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Reporter: Michael Stack
>Priority: Major
> Attachments: 
> 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, 
> 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch
>
>
> Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to 
> furiously open regions assigned by Master. It is importantly carrying 
> hbase:meta. Twenty minutes in, meta goes dead after an exception up out 
> AsyncFSWAL. Process had been restarted so I couldn't get a  thread dump. 
> Suspicious is we archive a WAL and we get a FNFE because we got to access WAL 
> in old location. [~Apache9] mind taking a look? Does this FNFE rolling kill 
> the WAL sub-system? Thanks.
> DFS complaining on file open for a few files getting blocks from remote dead 
> DNs: e.g. {{2018-04-25 10:05:21,506 WARN 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader.
> java.net.ConnectException: Connection refused}}
> AsyncFSWAL complaining: "AbstractFSWAL: Slow sync cost: 103 ms" .
> About ten minutes in, we get this:
> {code}
> 2018-04-25 10:15:16,532 WARN 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: sync failed
> java.io.IOException: stream already broken
>   at 
> org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424)
>   at 
> org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513)
>   
>   
>   
>   at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:134)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:364)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:547)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2018-04-25 10:15:16,680 INFO 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL 
> /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676253923.meta
>  with entries=10819, filesize=7.57 MB; new WAL 
> /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676516535.meta
> 2018-04-25 10:15:16,680 INFO 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving 
> hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524675848653.meta
>  to 
> hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524675848653.meta
> 2018-04-25 10:15:16,686 WARN 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter: Failed to 
> write trailer, non-fatal, continuing...
> java.io.IOException: stream already broken
>   at 
> org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424)
>   at 
> org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.lambda$writeWALTrailerAndMagic$3(AsyncProtobufLogWriter.java:210)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:166)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALTrailerAndMagic(AsyncProtobufLogWriter.java:201)
>   at 
> 

[jira] [Comment Edited] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?

2021-06-22 Thread Emil Kleszcz (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365637#comment-17365637
 ] 

Emil Kleszcz edited comment on HBASE-20503 at 6/22/21, 11:11 AM:
-

Hi, we experienced the same issue in HBase 2.3.4 on one of our production 
clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 
where we never observed this problem.
 We run on the HDP 3.2.1. On average we have around 800 regions per RS and the 
workload was, as usual, these days.

This problem started on one of the RSs where meta region was residing. We could 
observe the following in the RS log:
{code:java}
<2021-06-15T10:31:28.284+0200>  :   : 
java.io.IOException: stream already broken
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509)
(...)
<2021-06-15T11:11:39.744+0200>  : 
java.io.FileNotFoundException: File does not exist: /hbase/WALs/
(...)
<2021-06-15T11:15:59.241+0200>  
:   : 
java.io.IOException: stream already broken
(...)
<2021-06-15T11:39:39.986+0200>  : 
org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
result after 30 ms for txid=54056852, WAL system stuck?
(...)
<2021-06-16T18:43:05.220+0200>  : 

{code}
Since then compaction started failing on many regions including meta. 2 days 
later we could HMaster struggling with updating hbase:meta while updating state 
of one region...
 This triggered an avalanche of stuck procedures in HMaster
{code:java}
<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.866+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 
 
<2021-06-17T08:53:28.443+0200>  : {code}
In HA the Hmasters started flipping over and we could observe more and more 
RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or 
null). Only the manual fix (forcing states for tables/regions) helped to 
recover the cluster.
{code:java}
Failed transition b31a3040431b34256e265cd6c5a0c4e6 is not OPEN; 
state=CLOSING>{code}
I hope we have some patch for this soon.


was (Author: tr0k):
Hi, we experienced the same issue in HBase 2.3.4 on one of our production 
clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 
where we never observed this problem.
 We run on the HDP 3.2.1. On average we have around 800 regions per RS and the 
workload was, as usual, these days.

This problem started on one of the RSs where meta region was residing. We could 
observe the following in the RS log:
{code:java}
<2021-06-15T10:31:28.284+0200>  :   : 
java.io.IOException: stream already broken
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509)
(...)
<2021-06-15T11:11:39.744+0200>  : 
java.io.FileNotFoundException: File does not exist: /hbase/WALs/
(...)
<2021-06-15T11:15:59.241+0200>  
:   : 
java.io.IOException: stream already broken
(...)
<2021-06-15T11:39:39.986+0200>  : 
org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
result after 30 ms for txid=54056852, WAL system stuck?
(...)
<2021-06-16T18:43:05.220+0200>  : 

{code}
Since then compaction started failing on many regions including meta. 2 days 
later we could see one RS going down...
 This triggered an avalanche of stuck procedures in HMaster
{code:java}
<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.866+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 
 
<2021-06-17T08:53:28.443+0200>  : {code}
In HA the Hmasters started flipping over and we could observe more and more 
RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or 
null). Only the manual fix (forcing states for tables/regions) helped to 
recover the cluster.

I hope we have some patch for this soon.

> [AsyncFSWAL] Failed to get sync result after 30 ms for txid=160912, WAL 
> system stuck?
> -
>
> Key: HBASE-20503
> URL: https://issues.apache.org/jira/browse/HBASE-20503
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Reporter: Michael Stack
>Priority: Major
> Attachments: 
> 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, 
> 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch
>
>
> Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to 
> furiously open regions assigned by Master. It is importantly 

[jira] [Comment Edited] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?

2021-06-19 Thread Emil Kleszcz (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365637#comment-17365637
 ] 

Emil Kleszcz edited comment on HBASE-20503 at 6/19/21, 1:32 PM:


Hi, we experienced the same issue in HBase 2.3.4 on one of our production 
clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 
where we never observed this problem.
 We run on the HDP 3.2.1. On average we have around 800 regions per RS and the 
workload was, as usual, these days.

This problem started on one of the RSs where meta region was residing. We could 
observe the following in the RS log:
{code:java}
<2021-06-15T10:31:28.284+0200>  :   : 
java.io.IOException: stream already broken
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509)
(...)
<2021-06-15T11:11:39.744+0200>  : 
java.io.FileNotFoundException: File does not exist: /hbase/WALs/
(...)
<2021-06-15T11:15:59.241+0200>  
:   : 
java.io.IOException: stream already broken
(...)
<2021-06-15T11:39:39.986+0200>  : 
org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
result after 30 ms for txid=54056852, WAL system stuck?
(...)
<2021-06-16T18:43:05.220+0200>  : 

{code}
Since then compaction started failing on many regions including meta. 2 days 
later we could see one RS going down...
 This triggered an avalanche of stuck procedures in HMaster
{code:java}
<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.866+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 
 
<2021-06-17T08:53:28.443+0200>  : {code}
In HA the Hmasters started flipping over and we could observe more and more 
RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or 
null). Only the manual fix (forcing states for tables/regions) helped to 
recover the cluster.

I hope we have some patch for this soon.


was (Author: tr0k):
Hi, we experienced the same issue in HBase 2.3.4 on one of our production 
clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 
where we never observed this problem.
 We run on the HDP 3.2.1. On average we have around 800 regions per RS and the 
workload was, as usual, these days.

This problem started on one of the RSs where meta region residing. We could 
observe the following in the RS log:
{code:java}
<2021-06-15T10:31:28.284+0200>  :   : 
java.io.IOException: stream already broken
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509)
(...)
<2021-06-15T11:11:39.744+0200>  : 
java.io.FileNotFoundException: File does not exist: /hbase/WALs/
(...)
<2021-06-15T11:15:59.241+0200>  
:   : 
java.io.IOException: stream already broken
(...)
<2021-06-15T11:39:39.986+0200>  : 
org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
result after 30 ms for txid=54056852, WAL system stuck?
(...)
<2021-06-16T18:43:05.220+0200>  : 

{code}
Since then compaction started failing on many regions including meta. 2 days 
later we could see one RS going down...
 This triggered an avalanche of stuck procedures in HMaster
{code:java}
<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.866+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 
 
<2021-06-17T08:53:28.443+0200>  : {code}
In HA the Hmasters started flipping over and we could observe more and more 
RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or 
null). Only the manual fix (forcing states for tables/regions) helped to 
recover the cluster.

I hope you apply the working patch soon.

> [AsyncFSWAL] Failed to get sync result after 30 ms for txid=160912, WAL 
> system stuck?
> -
>
> Key: HBASE-20503
> URL: https://issues.apache.org/jira/browse/HBASE-20503
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Reporter: Michael Stack
>Priority: Major
> Attachments: 
> 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, 
> 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch
>
>
> Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to 
> furiously open regions assigned by Master. It is importantly carrying 
> hbase:meta. Twenty minutes in, meta goes dead after an exception up out 
> AsyncFSWAL. Process had been restarted so I couldn't get a  thread dump. 
> 

[jira] [Comment Edited] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?

2021-06-18 Thread Emil Kleszcz (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365637#comment-17365637
 ] 

Emil Kleszcz edited comment on HBASE-20503 at 6/18/21, 6:05 PM:


Hi, we experienced the same issue in HBase 2.3.4 on one of our production 
clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 
where we never observed this problem.
 We run on the HDP 3.2.1. On average we have around 800 regions per RS and the 
workload was, as usual, these days.

This problem started on one of the RSs where meta region residing. We could 
observe the following in the RS log:
{code:java}
<2021-06-15T10:31:28.284+0200>  :   : 
java.io.IOException: stream already broken
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509)
(...)
<2021-06-15T11:11:39.744+0200>  : 
java.io.FileNotFoundException: File does not exist: /hbase/WALs/
(...)
<2021-06-15T11:15:59.241+0200>  
:   : 
java.io.IOException: stream already broken
(...)
<2021-06-15T11:39:39.986+0200>  : 
org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
result after 30 ms for txid=54056852, WAL system stuck?
(...)
<2021-06-16T18:43:05.220+0200>  : 

{code}
Since then compaction started failing on many regions including meta. 2 days 
later we could see one RS going down...
 This triggered an avalanche of stuck procedures in HMaster
{code:java}
<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.866+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 
 
<2021-06-17T08:53:28.443+0200>  : {code}
In HA the Hmasters started flipping over and we could observe more and more 
RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or 
null). Only the manual fix (forcing states for tables/regions) helped to 
recover the cluster.

I hope you apply the working patch soon.


was (Author: tr0k):
Hi, we experienced the same issue in HBase 2.3.1 on one of our production 
clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 
where we never observed this problem.
 We run on the HDP 3.1.2. On average we have around 800 regions per RS and the 
workload was, as usual, these days.

This problem started on one of the RSs where meta region residing. We could 
observe the following in the RS log:
{code:java}
<2021-06-15T10:31:28.284+0200>  :   : 
java.io.IOException: stream already broken
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509)
(...)
<2021-06-15T11:11:39.744+0200>  : 
java.io.FileNotFoundException: File does not exist: /hbase/WALs/
(...)
<2021-06-15T11:15:59.241+0200>  
:   : 
java.io.IOException: stream already broken
(...)
<2021-06-15T11:39:39.986+0200>  : 
org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
result after 30 ms for txid=54056852, WAL system stuck?
(...)
<2021-06-16T18:43:05.220+0200>  : 

{code}
Since then compaction started failing on many regions including meta. 2 days 
later we could see one RS going down...
 This triggered an avalanche of stuck procedures in HMaster
{code:java}
<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.866+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 
 
<2021-06-17T08:53:28.443+0200>  : {code}
In HA the Hmasters started flipping over and we could observe more and more 
RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or 
null). Only the manual fix (forcing states for tables/regions) helped to 
recover the cluster.

I hope you apply the working patch soon.

> [AsyncFSWAL] Failed to get sync result after 30 ms for txid=160912, WAL 
> system stuck?
> -
>
> Key: HBASE-20503
> URL: https://issues.apache.org/jira/browse/HBASE-20503
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Reporter: Michael Stack
>Priority: Major
> Attachments: 
> 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, 
> 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch
>
>
> Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to 
> furiously open regions assigned by Master. It is importantly carrying 
> hbase:meta. Twenty minutes in, meta goes dead after an exception up out 
> AsyncFSWAL. Process had been restarted so I couldn't get a  thread dump. 
> 

[jira] [Comment Edited] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?

2021-06-18 Thread Emil Kleszcz (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365637#comment-17365637
 ] 

Emil Kleszcz edited comment on HBASE-20503 at 6/18/21, 6:04 PM:


Hi, we experienced the same issue in HBase 2.3.1 on one of our production 
clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 
where we never observed this problem.
 We run on the HDP 3.1.2. On average we have around 800 regions per RS and the 
workload was, as usual, these days.

This problem started on one of the RSs where meta region residing. We could 
observe the following in the RS log:
{code:java}
<2021-06-15T10:31:28.284+0200>  :   : 
java.io.IOException: stream already broken
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509)
(...)
<2021-06-15T11:11:39.744+0200>  : 
java.io.FileNotFoundException: File does not exist: /hbase/WALs/
(...)
<2021-06-15T11:15:59.241+0200>  
:   : 
java.io.IOException: stream already broken
(...)
<2021-06-15T11:39:39.986+0200>  : 
org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
result after 30 ms for txid=54056852, WAL system stuck?
(...)
<2021-06-16T18:43:05.220+0200>  : 

{code}
Since then compaction started failing on many regions including meta. 2 days 
later we could see one RS going down...
 This triggered an avalanche of stuck procedures in HMaster
{code:java}
<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.866+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 
 
<2021-06-17T08:53:28.443+0200>  : {code}
In HA the Hmasters started flipping over and we could observe more and more 
RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or 
null). Only the manual fix (forcing states for tables/regions) helped to 
recover the cluster.

I hope you apply the working patch soon.


was (Author: tr0k):
Hi, we experienced the same issue in HBase 2.3.1 on one of our production 
clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 
where we never observed this problem.
 We run on the HDP 3.1.2. On average we have around 800 regions per RS and the 
workload was, as usual, these days.

This problem started on one of the RSs where meta region residing. We could 
observe the following in the RS log:
{code:java}
<2021-06-15T10:31:28.284+0200>  :   : 
java.io.IOException: stream already broken
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509)
(...)
<2021-06-15T11:11:39.744+0200>  : 
java.io.FileNotFoundException: File does not exist: /hbase/WALs/
(...)
<2021-06-15T11:15:59.241+0200>  
:   : 
java.io.IOException: stream already broken
(...)
<2021-06-15T11:39:39.986+0200>  : 
org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
result after 30 ms for txid=54056852, WAL system stuck?(...)
<2021-06-16T18:43:05.220+0200>  : 

{code}
Since then compaction started failing on many regions including meta. 2 days 
later we could see one RS going down...
 This triggered an avalanche of stuck procedures in HMaster
{code:java}
<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.866+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 
 
<2021-06-17T08:53:28.443+0200>  : {code}
In HA the Hmasters started flipping over and we could observe more and more 
RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or 
null). Only the manual fix (forcing states for tables/regions) helped to 
recover the cluster.

I hope you apply the working patch soon.

> [AsyncFSWAL] Failed to get sync result after 30 ms for txid=160912, WAL 
> system stuck?
> -
>
> Key: HBASE-20503
> URL: https://issues.apache.org/jira/browse/HBASE-20503
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Reporter: Michael Stack
>Priority: Major
> Attachments: 
> 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, 
> 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch
>
>
> Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to 
> furiously open regions assigned by Master. It is importantly carrying 
> hbase:meta. Twenty minutes in, meta goes dead after an exception up out 
> AsyncFSWAL. Process had been restarted so I couldn't get a  thread dump. 
> 

[jira] [Comment Edited] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?

2021-06-18 Thread Emil Kleszcz (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365637#comment-17365637
 ] 

Emil Kleszcz edited comment on HBASE-20503 at 6/18/21, 6:02 PM:


Hi, we experienced the same issue in HBase 2.3.1 on one of our production 
clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 
where we never observed this problem.
 We run on the HDP 3.1.2. On average we have around 800 regions per RS and the 
workload was, as usual, these days.

This problem started on one of the RSs where meta region residing. We could 
observe the following in the RS log:
{code:java}
<2021-06-15T10:31:28.284+0200>  :   : 
java.io.IOException: stream already broken
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509)
(...)
<2021-06-15T11:11:39.744+0200>  : 
java.io.FileNotFoundException: File does not exist: /hbase/WALs/
(...)
<2021-06-15T11:15:59.241+0200>  
:   : 
java.io.IOException: stream already broken
(...)
<2021-06-15T11:39:39.986+0200>  : 
org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
result after 30 ms for txid=54056852, WAL system stuck?(...)
<2021-06-16T18:43:05.220+0200>  : 

{code}
Since then compaction started failing on many regions including meta. 2 days 
later we could see one RS going down...
 This triggered an avalanche of stuck procedures in HMaster
{code:java}
<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.866+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 
 
<2021-06-17T08:53:28.443+0200>  : {code}
In HA the Hmasters started flipping over and we could observe more and more 
RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or 
null). Only the manual fix (forcing states for tables/regions) helped to 
recover the cluster.

I hope you apply the working patch soon.


was (Author: tr0k):
Hi, we experienced the same issue in HBase 2.3.1 on one of our production 
clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 
where we never observed this problem.
 We run on the HDP 3.1.2. On average we have around 800 regions per RS and the 
workload was, as usual, these days.

This problem started on one of the RSs where meta region residing. We could 
observe the following in the RS log:
{code:java}
<2021-06-15T10:31:28.284+0200>  :   : 
java.io.IOException: stream already broken
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509)
(...)
<2021-06-15T11:11:39.744+0200>  : 
java.io.FileNotFoundException: File does not exist: /hbase/WALs/
(...)
<2021-06-15T11:15:59.241+0200>  
:   : 
java.io.IOException: stream already broken
(...)
<2021-06-15T11:39:39.986+0200>  : 
(...)
<2021-06-16T18:43:05.220+0200>  : 

{code}
Since then compaction started failing on many regions including meta. 2 days 
later we could see one RS going down...
 This triggered an avalanche of stuck procedures in HMaster
{code:java}
<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.866+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 
 
<2021-06-17T08:53:28.443+0200>  : {code}
In HA the Hmasters started flipping over and we could observe more and more 
RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or 
null). Only the manual fix (forcing states for tables/regions) helped to 
recover the cluster.

I hope you apply the working patch soon.

> [AsyncFSWAL] Failed to get sync result after 30 ms for txid=160912, WAL 
> system stuck?
> -
>
> Key: HBASE-20503
> URL: https://issues.apache.org/jira/browse/HBASE-20503
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Reporter: Michael Stack
>Priority: Major
> Attachments: 
> 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, 
> 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch
>
>
> Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to 
> furiously open regions assigned by Master. It is importantly carrying 
> hbase:meta. Twenty minutes in, meta goes dead after an exception up out 
> AsyncFSWAL. Process had been restarted so I couldn't get a  thread dump. 
> Suspicious is we archive a WAL and we get a FNFE because we got to access WAL 
> in old location. [~Apache9] mind taking a look? Does this 

[jira] [Comment Edited] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?

2021-06-18 Thread Emil Kleszcz (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365637#comment-17365637
 ] 

Emil Kleszcz edited comment on HBASE-20503 at 6/18/21, 5:38 PM:


Hi, we experienced the same issue in HBase 2.3.1 on one of our production 
clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 
where we never observed this problem.
 We run on the HDP 3.1.2. On average we have around 800 regions per RS and the 
workload was, as usual, these days.

This problem started on one of the RSs where meta region residing. We could 
observe the following in the RS log:
{code:java}
<2021-06-15T10:31:28.284+0200>  :   : 
java.io.IOException: stream already broken
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509)
(...)
<2021-06-15T11:11:39.744+0200>  : 
java.io.FileNotFoundException: File does not exist: /hbase/WALs/
(...)
<2021-06-15T11:15:59.241+0200>  
:   : 
java.io.IOException: stream already broken
(...)
<2021-06-15T11:39:39.986+0200>  : 
(...)
<2021-06-16T18:43:05.220+0200>  : 

{code}
Since then compaction started failing on many regions including meta. 2 days 
later we could see one RS going down...
 This triggered an avalanche of stuck procedures in HMaster
{code:java}
<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.866+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 
 
<2021-06-17T08:53:28.443+0200>  : {code}
In HA the Hmasters started flipping over and we could observe more and more 
RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or 
null). Only the manual fix (forcing states for tables/regions) helped to 
recover the cluster.

I hope you apply the working patch soon.


was (Author: tr0k):
Hi, we experienced the same issue in HBase 2.3.1 on one of our production 
clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 
where we never observed this problem.
We run on the HDP 3.1.2. On average we have around 800 regions per RS and the 
workload was, as usual, these days.

This problem started on one of the RSs where meta region residing. We could 
observe the following in the RS log:
{code:java}
<2021-06-15T10:31:28.284+0200>  :   : 
java.io.IOException: stream already broken
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509)
(...)
<2021-06-15T11:11:39.744+0200>  : 
java.io.FileNotFoundException: File does not exist: /hbase/WALs/
(...)
<2021-06-15T11:15:59.241+0200>  
:   : 
java.io.IOException: stream already broken
(...)
<2021-06-15T11:39:39.986+0200>  : 
(...)
<2021-06-16T18:43:05.220+0200>  : 

{code}
Since then compaction started failing on many regions including meta. 2 days 
later we could see one RS going down...
This triggered an avalanche of stuck procedures in HMaster

{{}}
{code:java}
<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.862+0200>  : 

<2021-06-17T08:53:13.866+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 

<2021-06-17T08:53:13.867+0200>  : 
 
<2021-06-17T08:53:28.443+0200>  : {code}
{{}}In HA they started flipping over and we could observe more and more RITs 
with OPENING and CLOSING states pointing to stale RSs (old timestamps or null). 
Only the manual fix (forcing states for tables/regions) helped to recover the 
cluster.

I hope you apply the working patch soon.

> [AsyncFSWAL] Failed to get sync result after 30 ms for txid=160912, WAL 
> system stuck?
> -
>
> Key: HBASE-20503
> URL: https://issues.apache.org/jira/browse/HBASE-20503
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Reporter: Michael Stack
>Priority: Major
> Attachments: 
> 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, 
> 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch
>
>
> Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to 
> furiously open regions assigned by Master. It is importantly carrying 
> hbase:meta. Twenty minutes in, meta goes dead after an exception up out 
> AsyncFSWAL. Process had been restarted so I couldn't get a  thread dump. 
> Suspicious is we archive a WAL and we get a FNFE because we got to access WAL 
> in old location. [~Apache9] mind taking a look? Does this FNFE rolling kill 
> the WAL sub-system? Thanks.
> DFS complaining on file open for a few files getting blocks from remote dead 
> DNs: