[jira] [Comment Edited] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?
[ https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477853#comment-17477853 ] Viraj Jasani edited comment on HBASE-20503 at 1/18/22, 1:28 PM: Upgrading hbase 1.6.0 to 2.4.6/7, facing the similar issue: {code:java} 2022-01-18 12:45:59,051 WARN [Close-WAL-Writer-4] wal.AbstractProtobufLogWriter - Failed to write trailer, non-fatal, continuing... java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509) at org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.lambda$writeWALTrailerAndMagic$3(AsyncProtobufLogWriter.java:231) at org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:187) at org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALTrailerAndMagic(AsyncProtobufLogWriter.java:222) at org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.writeWALTrailer(AbstractProtobufLogWriter.java:261) at org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:157) at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$closeWriter$5(AsyncFSWAL.java:698) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} {code:java} 2022-01-18 12:45:59,051 WARN [Close-WAL-Writer-4] wal.AsyncProtobufLogWriter - normal close failed, try recover java.lang.IllegalStateException: should call flush first before calling close at org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:510) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.endBlock(FanOutOneBlockAsyncDFSOutput.java:514) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.close(FanOutOneBlockAsyncDFSOutput.java:565) at org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:158) at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$closeWriter$5(AsyncFSWAL.java:698) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} {code:java} 2022-01-18 12:46:14,054 WARN [hbase] wal.AsyncFSWAL - sync failed java.io.IOException: Timeout(15000ms) waiting for response at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.lambda$userEventTriggered$4(FanOutOneBlockAsyncDFSOutput.java:300) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.failed(FanOutOneBlockAsyncDFSOutput.java:233) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.access$300(FanOutOneBlockAsyncDFSOutput.java:98) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.userEventTriggered(FanOutOneBlockAsyncDFSOutput.java:299) {code} Clients no longer able to continue ingestion due to write failures, happens only after masters are upgraded to 2.4.x. Hadoop running on 2.10.1, no datanodes or namenodes restarted. Two frequent WARN logs observed when AsyncFSWAL came into picture after hmaster upgrade: {code:java} 2022-01-18 12:41:58,411 WARN [ype=LAST_IN_PIPELINE] datanode.DataNode - IOException in BlockReceiver.run(): java.io.IOException: Shutting down writer and responder due to a checksum error in received data. The error response has been sent upstream. at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1647) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1565) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1478) at java.lang.Thread.run(Thread.java:748) {code} {code:java} 2022-01-18 11:45:25,779 WARN [t/disk1/hdfs/current] impl.FsDatasetAsyncDiskService - sync_file_range error EBADF: Bad file descriptor at org.apache.hadoop.io.nativeio.NativeIO$POSIX.sync_file_range(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$POSIX.syncFileRangeIfPossible(NativeIO.java:287) at org.apache.hadoop.hdfs.server.datanode.FileIoProvider.syncFileRange(FileIoProvider.java:189) at
[jira] [Comment Edited] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?
[ https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405575#comment-17405575 ] Michael Stack edited comment on HBASE-20503 at 8/27/21, 4:47 AM: - [~kingWang] -have you looked in your datanode logs? In our case, the datanode was restarting unexpectedly – resource constraint.- <= Ignore. Wrong context. was (Author: stack): [~kingWang] have you looked in your datanode logs? In our case, the datanode was restarting unexpectedly – resource constraint. > [AsyncFSWAL] Failed to get sync result after 30 ms for txid=160912, WAL > system stuck? > - > > Key: HBASE-20503 > URL: https://issues.apache.org/jira/browse/HBASE-20503 > Project: HBase > Issue Type: Bug > Components: wal >Reporter: Michael Stack >Priority: Major > Attachments: > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch > > > Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to > furiously open regions assigned by Master. It is importantly carrying > hbase:meta. Twenty minutes in, meta goes dead after an exception up out > AsyncFSWAL. Process had been restarted so I couldn't get a thread dump. > Suspicious is we archive a WAL and we get a FNFE because we got to access WAL > in old location. [~Apache9] mind taking a look? Does this FNFE rolling kill > the WAL sub-system? Thanks. > DFS complaining on file open for a few files getting blocks from remote dead > DNs: e.g. {{2018-04-25 10:05:21,506 WARN > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing > remote block reader. > java.net.ConnectException: Connection refused}} > AsyncFSWAL complaining: "AbstractFSWAL: Slow sync cost: 103 ms" . > About ten minutes in, we get this: > {code} > 2018-04-25 10:15:16,532 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: sync failed > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:134) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:364) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:547) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:15:16,680 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676253923.meta > with entries=10819, filesize=7.57 MB; new WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676516535.meta > 2018-04-25 10:15:16,680 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524675848653.meta > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524675848653.meta > 2018-04-25 10:15:16,686 WARN > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter: Failed to > write trailer, non-fatal, continuing... > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.lambda$writeWALTrailerAndMagic$3(AsyncProtobufLogWriter.java:210) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:166) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALTrailerAndMagic(AsyncProtobufLogWriter.java:201) > at >
[jira] [Comment Edited] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?
[ https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365637#comment-17365637 ] Emil Kleszcz edited comment on HBASE-20503 at 6/22/21, 11:11 AM: - Hi, we experienced the same issue in HBase 2.3.4 on one of our production clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 where we never observed this problem. We run on the HDP 3.2.1. On average we have around 800 regions per RS and the workload was, as usual, these days. This problem started on one of the RSs where meta region was residing. We could observe the following in the RS log: {code:java} <2021-06-15T10:31:28.284+0200> : : java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509) (...) <2021-06-15T11:11:39.744+0200> : java.io.FileNotFoundException: File does not exist: /hbase/WALs/ (...) <2021-06-15T11:15:59.241+0200> : : java.io.IOException: stream already broken (...) <2021-06-15T11:39:39.986+0200> : org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 30 ms for txid=54056852, WAL system stuck? (...) <2021-06-16T18:43:05.220+0200> : {code} Since then compaction started failing on many regions including meta. 2 days later we could HMaster struggling with updating hbase:meta while updating state of one region... This triggered an avalanche of stuck procedures in HMaster {code:java} <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.866+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:28.443+0200> : {code} In HA the Hmasters started flipping over and we could observe more and more RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or null). Only the manual fix (forcing states for tables/regions) helped to recover the cluster. {code:java} Failed transition b31a3040431b34256e265cd6c5a0c4e6 is not OPEN; state=CLOSING>{code} I hope we have some patch for this soon. was (Author: tr0k): Hi, we experienced the same issue in HBase 2.3.4 on one of our production clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 where we never observed this problem. We run on the HDP 3.2.1. On average we have around 800 regions per RS and the workload was, as usual, these days. This problem started on one of the RSs where meta region was residing. We could observe the following in the RS log: {code:java} <2021-06-15T10:31:28.284+0200> : : java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509) (...) <2021-06-15T11:11:39.744+0200> : java.io.FileNotFoundException: File does not exist: /hbase/WALs/ (...) <2021-06-15T11:15:59.241+0200> : : java.io.IOException: stream already broken (...) <2021-06-15T11:39:39.986+0200> : org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 30 ms for txid=54056852, WAL system stuck? (...) <2021-06-16T18:43:05.220+0200> : {code} Since then compaction started failing on many regions including meta. 2 days later we could see one RS going down... This triggered an avalanche of stuck procedures in HMaster {code:java} <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.866+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:28.443+0200> : {code} In HA the Hmasters started flipping over and we could observe more and more RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or null). Only the manual fix (forcing states for tables/regions) helped to recover the cluster. I hope we have some patch for this soon. > [AsyncFSWAL] Failed to get sync result after 30 ms for txid=160912, WAL > system stuck? > - > > Key: HBASE-20503 > URL: https://issues.apache.org/jira/browse/HBASE-20503 > Project: HBase > Issue Type: Bug > Components: wal >Reporter: Michael Stack >Priority: Major > Attachments: > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch > > > Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to > furiously open regions assigned by Master. It is importantly
[jira] [Comment Edited] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?
[ https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365637#comment-17365637 ] Emil Kleszcz edited comment on HBASE-20503 at 6/19/21, 1:32 PM: Hi, we experienced the same issue in HBase 2.3.4 on one of our production clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 where we never observed this problem. We run on the HDP 3.2.1. On average we have around 800 regions per RS and the workload was, as usual, these days. This problem started on one of the RSs where meta region was residing. We could observe the following in the RS log: {code:java} <2021-06-15T10:31:28.284+0200> : : java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509) (...) <2021-06-15T11:11:39.744+0200> : java.io.FileNotFoundException: File does not exist: /hbase/WALs/ (...) <2021-06-15T11:15:59.241+0200> : : java.io.IOException: stream already broken (...) <2021-06-15T11:39:39.986+0200> : org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 30 ms for txid=54056852, WAL system stuck? (...) <2021-06-16T18:43:05.220+0200> : {code} Since then compaction started failing on many regions including meta. 2 days later we could see one RS going down... This triggered an avalanche of stuck procedures in HMaster {code:java} <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.866+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:28.443+0200> : {code} In HA the Hmasters started flipping over and we could observe more and more RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or null). Only the manual fix (forcing states for tables/regions) helped to recover the cluster. I hope we have some patch for this soon. was (Author: tr0k): Hi, we experienced the same issue in HBase 2.3.4 on one of our production clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 where we never observed this problem. We run on the HDP 3.2.1. On average we have around 800 regions per RS and the workload was, as usual, these days. This problem started on one of the RSs where meta region residing. We could observe the following in the RS log: {code:java} <2021-06-15T10:31:28.284+0200> : : java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509) (...) <2021-06-15T11:11:39.744+0200> : java.io.FileNotFoundException: File does not exist: /hbase/WALs/ (...) <2021-06-15T11:15:59.241+0200> : : java.io.IOException: stream already broken (...) <2021-06-15T11:39:39.986+0200> : org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 30 ms for txid=54056852, WAL system stuck? (...) <2021-06-16T18:43:05.220+0200> : {code} Since then compaction started failing on many regions including meta. 2 days later we could see one RS going down... This triggered an avalanche of stuck procedures in HMaster {code:java} <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.866+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:28.443+0200> : {code} In HA the Hmasters started flipping over and we could observe more and more RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or null). Only the manual fix (forcing states for tables/regions) helped to recover the cluster. I hope you apply the working patch soon. > [AsyncFSWAL] Failed to get sync result after 30 ms for txid=160912, WAL > system stuck? > - > > Key: HBASE-20503 > URL: https://issues.apache.org/jira/browse/HBASE-20503 > Project: HBase > Issue Type: Bug > Components: wal >Reporter: Michael Stack >Priority: Major > Attachments: > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch > > > Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to > furiously open regions assigned by Master. It is importantly carrying > hbase:meta. Twenty minutes in, meta goes dead after an exception up out > AsyncFSWAL. Process had been restarted so I couldn't get a thread dump. >
[jira] [Comment Edited] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?
[ https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365637#comment-17365637 ] Emil Kleszcz edited comment on HBASE-20503 at 6/18/21, 6:05 PM: Hi, we experienced the same issue in HBase 2.3.4 on one of our production clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 where we never observed this problem. We run on the HDP 3.2.1. On average we have around 800 regions per RS and the workload was, as usual, these days. This problem started on one of the RSs where meta region residing. We could observe the following in the RS log: {code:java} <2021-06-15T10:31:28.284+0200> : : java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509) (...) <2021-06-15T11:11:39.744+0200> : java.io.FileNotFoundException: File does not exist: /hbase/WALs/ (...) <2021-06-15T11:15:59.241+0200> : : java.io.IOException: stream already broken (...) <2021-06-15T11:39:39.986+0200> : org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 30 ms for txid=54056852, WAL system stuck? (...) <2021-06-16T18:43:05.220+0200> : {code} Since then compaction started failing on many regions including meta. 2 days later we could see one RS going down... This triggered an avalanche of stuck procedures in HMaster {code:java} <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.866+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:28.443+0200> : {code} In HA the Hmasters started flipping over and we could observe more and more RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or null). Only the manual fix (forcing states for tables/regions) helped to recover the cluster. I hope you apply the working patch soon. was (Author: tr0k): Hi, we experienced the same issue in HBase 2.3.1 on one of our production clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 where we never observed this problem. We run on the HDP 3.1.2. On average we have around 800 regions per RS and the workload was, as usual, these days. This problem started on one of the RSs where meta region residing. We could observe the following in the RS log: {code:java} <2021-06-15T10:31:28.284+0200> : : java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509) (...) <2021-06-15T11:11:39.744+0200> : java.io.FileNotFoundException: File does not exist: /hbase/WALs/ (...) <2021-06-15T11:15:59.241+0200> : : java.io.IOException: stream already broken (...) <2021-06-15T11:39:39.986+0200> : org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 30 ms for txid=54056852, WAL system stuck? (...) <2021-06-16T18:43:05.220+0200> : {code} Since then compaction started failing on many regions including meta. 2 days later we could see one RS going down... This triggered an avalanche of stuck procedures in HMaster {code:java} <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.866+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:28.443+0200> : {code} In HA the Hmasters started flipping over and we could observe more and more RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or null). Only the manual fix (forcing states for tables/regions) helped to recover the cluster. I hope you apply the working patch soon. > [AsyncFSWAL] Failed to get sync result after 30 ms for txid=160912, WAL > system stuck? > - > > Key: HBASE-20503 > URL: https://issues.apache.org/jira/browse/HBASE-20503 > Project: HBase > Issue Type: Bug > Components: wal >Reporter: Michael Stack >Priority: Major > Attachments: > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch > > > Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to > furiously open regions assigned by Master. It is importantly carrying > hbase:meta. Twenty minutes in, meta goes dead after an exception up out > AsyncFSWAL. Process had been restarted so I couldn't get a thread dump. >
[jira] [Comment Edited] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?
[ https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365637#comment-17365637 ] Emil Kleszcz edited comment on HBASE-20503 at 6/18/21, 6:04 PM: Hi, we experienced the same issue in HBase 2.3.1 on one of our production clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 where we never observed this problem. We run on the HDP 3.1.2. On average we have around 800 regions per RS and the workload was, as usual, these days. This problem started on one of the RSs where meta region residing. We could observe the following in the RS log: {code:java} <2021-06-15T10:31:28.284+0200> : : java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509) (...) <2021-06-15T11:11:39.744+0200> : java.io.FileNotFoundException: File does not exist: /hbase/WALs/ (...) <2021-06-15T11:15:59.241+0200> : : java.io.IOException: stream already broken (...) <2021-06-15T11:39:39.986+0200> : org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 30 ms for txid=54056852, WAL system stuck? (...) <2021-06-16T18:43:05.220+0200> : {code} Since then compaction started failing on many regions including meta. 2 days later we could see one RS going down... This triggered an avalanche of stuck procedures in HMaster {code:java} <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.866+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:28.443+0200> : {code} In HA the Hmasters started flipping over and we could observe more and more RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or null). Only the manual fix (forcing states for tables/regions) helped to recover the cluster. I hope you apply the working patch soon. was (Author: tr0k): Hi, we experienced the same issue in HBase 2.3.1 on one of our production clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 where we never observed this problem. We run on the HDP 3.1.2. On average we have around 800 regions per RS and the workload was, as usual, these days. This problem started on one of the RSs where meta region residing. We could observe the following in the RS log: {code:java} <2021-06-15T10:31:28.284+0200> : : java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509) (...) <2021-06-15T11:11:39.744+0200> : java.io.FileNotFoundException: File does not exist: /hbase/WALs/ (...) <2021-06-15T11:15:59.241+0200> : : java.io.IOException: stream already broken (...) <2021-06-15T11:39:39.986+0200> : org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 30 ms for txid=54056852, WAL system stuck?(...) <2021-06-16T18:43:05.220+0200> : {code} Since then compaction started failing on many regions including meta. 2 days later we could see one RS going down... This triggered an avalanche of stuck procedures in HMaster {code:java} <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.866+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:28.443+0200> : {code} In HA the Hmasters started flipping over and we could observe more and more RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or null). Only the manual fix (forcing states for tables/regions) helped to recover the cluster. I hope you apply the working patch soon. > [AsyncFSWAL] Failed to get sync result after 30 ms for txid=160912, WAL > system stuck? > - > > Key: HBASE-20503 > URL: https://issues.apache.org/jira/browse/HBASE-20503 > Project: HBase > Issue Type: Bug > Components: wal >Reporter: Michael Stack >Priority: Major > Attachments: > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch > > > Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to > furiously open regions assigned by Master. It is importantly carrying > hbase:meta. Twenty minutes in, meta goes dead after an exception up out > AsyncFSWAL. Process had been restarted so I couldn't get a thread dump. >
[jira] [Comment Edited] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?
[ https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365637#comment-17365637 ] Emil Kleszcz edited comment on HBASE-20503 at 6/18/21, 6:02 PM: Hi, we experienced the same issue in HBase 2.3.1 on one of our production clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 where we never observed this problem. We run on the HDP 3.1.2. On average we have around 800 regions per RS and the workload was, as usual, these days. This problem started on one of the RSs where meta region residing. We could observe the following in the RS log: {code:java} <2021-06-15T10:31:28.284+0200> : : java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509) (...) <2021-06-15T11:11:39.744+0200> : java.io.FileNotFoundException: File does not exist: /hbase/WALs/ (...) <2021-06-15T11:15:59.241+0200> : : java.io.IOException: stream already broken (...) <2021-06-15T11:39:39.986+0200> : org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 30 ms for txid=54056852, WAL system stuck?(...) <2021-06-16T18:43:05.220+0200> : {code} Since then compaction started failing on many regions including meta. 2 days later we could see one RS going down... This triggered an avalanche of stuck procedures in HMaster {code:java} <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.866+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:28.443+0200> : {code} In HA the Hmasters started flipping over and we could observe more and more RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or null). Only the manual fix (forcing states for tables/regions) helped to recover the cluster. I hope you apply the working patch soon. was (Author: tr0k): Hi, we experienced the same issue in HBase 2.3.1 on one of our production clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 where we never observed this problem. We run on the HDP 3.1.2. On average we have around 800 regions per RS and the workload was, as usual, these days. This problem started on one of the RSs where meta region residing. We could observe the following in the RS log: {code:java} <2021-06-15T10:31:28.284+0200> : : java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509) (...) <2021-06-15T11:11:39.744+0200> : java.io.FileNotFoundException: File does not exist: /hbase/WALs/ (...) <2021-06-15T11:15:59.241+0200> : : java.io.IOException: stream already broken (...) <2021-06-15T11:39:39.986+0200> : (...) <2021-06-16T18:43:05.220+0200> : {code} Since then compaction started failing on many regions including meta. 2 days later we could see one RS going down... This triggered an avalanche of stuck procedures in HMaster {code:java} <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.866+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:28.443+0200> : {code} In HA the Hmasters started flipping over and we could observe more and more RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or null). Only the manual fix (forcing states for tables/regions) helped to recover the cluster. I hope you apply the working patch soon. > [AsyncFSWAL] Failed to get sync result after 30 ms for txid=160912, WAL > system stuck? > - > > Key: HBASE-20503 > URL: https://issues.apache.org/jira/browse/HBASE-20503 > Project: HBase > Issue Type: Bug > Components: wal >Reporter: Michael Stack >Priority: Major > Attachments: > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch > > > Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to > furiously open regions assigned by Master. It is importantly carrying > hbase:meta. Twenty minutes in, meta goes dead after an exception up out > AsyncFSWAL. Process had been restarted so I couldn't get a thread dump. > Suspicious is we archive a WAL and we get a FNFE because we got to access WAL > in old location. [~Apache9] mind taking a look? Does this
[jira] [Comment Edited] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?
[ https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365637#comment-17365637 ] Emil Kleszcz edited comment on HBASE-20503 at 6/18/21, 5:38 PM: Hi, we experienced the same issue in HBase 2.3.1 on one of our production clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 where we never observed this problem. We run on the HDP 3.1.2. On average we have around 800 regions per RS and the workload was, as usual, these days. This problem started on one of the RSs where meta region residing. We could observe the following in the RS log: {code:java} <2021-06-15T10:31:28.284+0200> : : java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509) (...) <2021-06-15T11:11:39.744+0200> : java.io.FileNotFoundException: File does not exist: /hbase/WALs/ (...) <2021-06-15T11:15:59.241+0200> : : java.io.IOException: stream already broken (...) <2021-06-15T11:39:39.986+0200> : (...) <2021-06-16T18:43:05.220+0200> : {code} Since then compaction started failing on many regions including meta. 2 days later we could see one RS going down... This triggered an avalanche of stuck procedures in HMaster {code:java} <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.866+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:28.443+0200> : {code} In HA the Hmasters started flipping over and we could observe more and more RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or null). Only the manual fix (forcing states for tables/regions) helped to recover the cluster. I hope you apply the working patch soon. was (Author: tr0k): Hi, we experienced the same issue in HBase 2.3.1 on one of our production clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 where we never observed this problem. We run on the HDP 3.1.2. On average we have around 800 regions per RS and the workload was, as usual, these days. This problem started on one of the RSs where meta region residing. We could observe the following in the RS log: {code:java} <2021-06-15T10:31:28.284+0200> : : java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509) (...) <2021-06-15T11:11:39.744+0200> : java.io.FileNotFoundException: File does not exist: /hbase/WALs/ (...) <2021-06-15T11:15:59.241+0200> : : java.io.IOException: stream already broken (...) <2021-06-15T11:39:39.986+0200> : (...) <2021-06-16T18:43:05.220+0200> : {code} Since then compaction started failing on many regions including meta. 2 days later we could see one RS going down... This triggered an avalanche of stuck procedures in HMaster {{}} {code:java} <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.862+0200> : <2021-06-17T08:53:13.866+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:13.867+0200> : <2021-06-17T08:53:28.443+0200> : {code} {{}}In HA they started flipping over and we could observe more and more RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or null). Only the manual fix (forcing states for tables/regions) helped to recover the cluster. I hope you apply the working patch soon. > [AsyncFSWAL] Failed to get sync result after 30 ms for txid=160912, WAL > system stuck? > - > > Key: HBASE-20503 > URL: https://issues.apache.org/jira/browse/HBASE-20503 > Project: HBase > Issue Type: Bug > Components: wal >Reporter: Michael Stack >Priority: Major > Attachments: > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch > > > Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to > furiously open regions assigned by Master. It is importantly carrying > hbase:meta. Twenty minutes in, meta goes dead after an exception up out > AsyncFSWAL. Process had been restarted so I couldn't get a thread dump. > Suspicious is we archive a WAL and we get a FNFE because we got to access WAL > in old location. [~Apache9] mind taking a look? Does this FNFE rolling kill > the WAL sub-system? Thanks. > DFS complaining on file open for a few files getting blocks from remote dead > DNs: