2.5.9 有个大 bug,建议等一下 2.5.10,马上就发 2.6.0 和 2.5.x 没有太大区别,只有 replication 这块确实是有比较大重构,不知道是不是有啥异常的 case 没覆盖到
sudo rm -rf /* <2326130...@qq.com.invalid> 于2024年7月22日周一 22:19写道: > > 嗯,replication queue积压到800多,应该不是最后一个文件,是中间某个文件,而且卡住的wal > 貌似是空文件,导致wal读取失败,一直报那个异常,不会自己恢复。 > 涉及的代码片段: > @Override > public Entry next(Entry reuse) throws IOException { > long originalPosition = getPosition(); > if (reachWALEditsStopOffset(originalPosition)) { > return null; > } > WALProtos.WALKey walKey; > try { > // for one way stream reader, we do not care about what > is the exact position where we hit the > // EOF or IOE, so just use the helper method to parse > WALKey, in tailing reader, we will try > // to read the varint size by ourselves > walKey = ProtobufUtil.parseDelimitedFrom(inputStream, > WALProtos.WALKey.parser()); > } catch (InvalidProtocolBufferException e) { > if (ProtobufUtil.isEOF(e) || > isWALTrailer(originalPosition)) { > // > InvalidProtocolBufferException.truncatedMessage, should throw EOF > // or we have started to read the partial > WALTrailer > throw (EOFException) new EOFException("EOF while > reading WALKey, originalPosition=" > + originalPosition + ", currentPosition=" > + inputStream.getPos()).initCause(e); > } else { > // For all other type of IPBEs, it means the WAL > key is broken, throw IOException out to let > // the upper layer know, unless we have already > reached the partial WALTrailer > throw (IOException) new IOException("Error while > reading WALKey, originalPosition=" > + originalPosition + ", currentPosition=" > + inputStream.getPos()).initCause(e); > } > } > > > > ProtobufWALStreamReader类,ProtobufWALStreamReader应该是2.6.0的新特性,之前2.x版本没有这个类, > 张老师,在生产环境建议使用2.5.9还是2.6.0 > > > ------------------ 原始邮件 ------------------ > 发件人: > "user-zh" > > <palomino...@gmail.com>; > 发送时间: 2024年7月22日(星期一) 晚上10:11 > 收件人: "user-zh"<user-zh@hbase.apache.org>; > > 主题: Re: hbase2.6.0 replicationSource WALReader读取WAL异常 > > > > 还是一直报一样的问题?那说明可能是切换 reader 实现的地方有 bug,应该用 > ProtobufWALStreamReader,不应该再用 TailingReader 了 > > sudo rm -rf /* <2326130...@qq.com.invalid> 于2024年7月22日周一 22:06写道: > > > > 我试试附件,或者明天再发下截图,在家里,谷歌邮箱登陆不了。不是最后一个文件,replication queue已经积压到了8百多 > > > > > > ------------------ 原始邮件 ------------------ > > 发件人: "user-zh" <palomino...@gmail.com>; > > 发送时间: 2024年7月22日(星期一) 晚上9:56 > > 收件人: "user-zh"<user-zh@hbase.apache.org>; > > 主题: Re: hbase2.6.0 replicationSource WALReader读取WAL异常 > > > > 截图似乎挂了,看不到。。。 > > > > 如果还在用 tailing reader 读,说明这是最后一个文件,他是不会跳过空文件的 > > > > 如果已经有新的 WAL 文件了,应该不会继续用 tailing reader 读了,这个时候如果遇到 EOF 了,是有逻辑直接跳过的 > > > > 现在 tailing reader 一直在读的是最后一个文件吗?还是其实已经不是最后一个文件了,但还是一直在用 tailing reader 读? > > > > sudo rm -rf /* <2326130...@qq.com.invalid> 于2024年7月22日周一 21:41写道: > > > > > 张老师 > > > 您好,感谢您的回复,replication卡住了,我挑选了一个RS节点,replication > status如下截图: > > > > > > > > > > 截图中第一个文件格式是:hdfs://coreHBaseProdHa/hbase/WALs/sh2-int-hbase-main-ha-2,16020,1720603345541/sh2-int-hbase-main-ha-2%2C16020%2C1720603345541.1720606991648 > > > 第一个文件已经不存在了 > > > 第二个 三个文件指向oldWals目录中,文件存在,用hbase wal -p 文件读,报错如下: > > > Writer Classes: ProtobufLogWriter AsyncProtobufLogWriter > > > SecureProtobufLogWriter SecureAsyncProtobufLogWriter > > > Cell Codec Class: > org.apache.hadoop.hbase.regionserver.wal.WALCellCodec > > > Exception in thread "main" java.io.EOFException: EOF while reading > message > > > size > > > at > > > > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.parseDelimitedFrom(ProtobufUtil.java:3727) > > > at > > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufWALStreamReader.next(ProtobufWALStreamReader.java:56) > > > at > > > > org.apache.hadoop.hbase.wal.WALStreamReader.next(WALStreamReader.java:42) > > > at > > > > org.apache.hadoop.hbase.wal.WALPrettyPrinter.processFile(WALPrettyPrinter.java:297) > > > at > > > > org.apache.hadoop.hbase.wal.WALPrettyPrinter.run(WALPrettyPrinter.java:516) > > > at > > > > org.apache.hadoop.hbase.wal.WALPrettyPrinter.main(WALPrettyPrinter.java:429) > > > 像是一个读到空文件的报错, > > > 其他正常WAL文件,hbase wal -p 命令运行正常,能解析wal文件的内容。您有空帮忙再看看,非常感谢 > > > > > > > > > ------------------ 原始邮件 ------------------ > > > *发件人:* "user-zh" <palomino...@gmail.com>; > > > *发送时间:* 2024年7月22日(星期一) 晚上8:56 > > > *收件人:* "user-zh"<user-zh@hbase.apache.org>; > > > *主题:* Re: hbase2.6.0 replicationSource WALReader读取WAL异常 > > > > > > Replication 卡了吗?Stream reader 是在不停的 tail > > > 文件的,如果遇到写了一半的就是有可能出异常,他会重试。如果没卡,后面还能继续读说明就没问题 > > > > > > 你也可以尝试用 WALPrettyPrinter 去读一下那个文件看看能不能读? > > > > > > leojie <leo...@apache.org> 于2024年7月22日周一 18:03写道: > > > > > > > > 张老师 > > > > > 您好,请教一个问题,最近在测试hbase2.6.0,在开启replication时,replication > > > > > > > > Source线程中,目前使用ProtobufWALStreamReader类(2.6.0新类)读取和解析WAL文件,遇到异常如下:InvalidProtocolBufferException$InvalidWireTypeException: > > > > Protocol message tag had invalid wire type. > > > > > 看了源码,没看太懂,涉及底层Protocol序列化的问题,会是因为使用低版本hbase-client(比如:hbase2.2.7) api > > > > 写入数据导致的么 > > > > 我的环境是:hadoop3.3.6 hbase2.6.0 > > > > 详细的异常堆栈如下: > > > > 2024-07-22T17:47:49,130 WARN > > > > > > > > [RS_CLAIM_REPLICATION_QUEUE-regionserver/sh2-int-hbase-main-ha-9:16020-0.replicationSource,test_hbase_258-tx1-int-hbase-main-prod-3,16020,1720602522464.replicationSource.wal-reader.tx1-int-hbase-main-prod-3%2C16020%2C1720602522464,test_hbase_258-tx1-int-hbase-main-prod-3,16020,1720602522464] > > > > wal.ProtobufWALStreamReader: Error while reading WALKey, > > > > originalPosition=0, currentPosition=81 > > > > > > > > org.apache.hbase.thirdparty.com.google.protobuf.InvalidProtocolBufferException$InvalidWireTypeException: > > > > Protocol message tag had invalid wire type. > > > > at > > > > > > > > org.apache.hbase.thirdparty.com.google.protobuf.InvalidProtocolBufferException.invalidWireType(InvalidProtocolBufferException.java:119) > > > > ~[hbase-shaded-protobuf-4.1.7.jar:4.1.7] > > > > at > > > > > > > > org.apache.hbase.thirdparty.com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(UnknownFieldSet.java:503) > > > > ~[hbase-shaded-protobuf-4.1.7.jar:4.1.7] > > > > at > > > > > > > > org.apache.hbase.thirdparty.com.google.protobuf.GeneratedMessage$Builder.parseUnknownField(GeneratedMessage.java:770) > > > > ~[hbase-shaded-protobuf-4.1.7.jar:4.1.7] > > > > at > > > > > > > > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALKey$Builder.mergeFrom(WALProtos.java:2829) > > > > ~[hbase-protocol-shaded-2.6.0.jar:2.6.0] > > > > at > > > > > > > > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALKey$1.parsePartialFrom(WALProtos.java:4212) > > > > ~[hbase-protocol-shaded-2.6.0.jar:2.6.0] > > > > at > > > > > > > > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALKey$1.parsePartialFrom(WALProtos.java:4204) > > > > ~[hbase-protocol-shaded-2.6.0.jar:2.6.0] > > > > at > > > > > > > > org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:192) > > > > ~[hbase-shaded-protobuf-4.1.7.jar:4.1.7] > > > > at > > > > > > > > org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:209) > > > > ~[hbase-shaded-protobuf-4.1.7.jar:4.1.7] > > > > at > > > > > > > > org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:214) > > > > ~[hbase-shaded-protobuf-4.1.7.jar:4.1.7] > > > > at > > > > > > > > org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:25) > > > > ~[hbase-shaded-protobuf-4.1.7.jar:4.1.7] > > > > at > > > > > > > > org.apache.hbase.thirdparty.com.google.protobuf.GeneratedMessage.parseWithIOException(GeneratedMessage.java:321) > > > > ~[hbase-shaded-protobuf-4.1.7.jar:4.1.7] > > > > at > > > > > > > > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALKey.parseFrom(WALProtos.java:2321) > > > > ~[hbase-protocol-shaded-2.6.0.jar:2.6.0] > > > > at > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufWALTailingReader.readWALKey(ProtobufWALTailingReader.java:128) > > > > ~[hbase-server-2.6.0.jar:2.6.0] > > > > at > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufWALTailingReader.next(ProtobufWALTailingReader.java:257) > > > > ~[hbase-server-2.6.0.jar:2.6.0] > > > > at > > > > > > > > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:490) > > > > ~[hbase-server-2.6.0.jar:2.6.0] > > > > at > > > > > > > > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.lastAttempt(WALEntryStream.java:306) > > > > ~[hbase-server-2.6.0.jar:2.6.0] > > > > at > > > > > > > > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:388) > > > > ~[hbase-server-2.6.0.jar:2.6.0] > > > > at > > > > > > > > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:130) > > > > ~[hbase-server-2.6.0.jar:2.6.0] > > > > at > > > > > > > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:153) > > > > ~[hbase-server-2.6.0.jar:2.6.0] > > > > 2024-07-22T17:48:13,315 WARN [RS-EventLoopGroup-1-65] > > > > ipc.NettyRpcConnection: Exception encountered while connecting > to the > > > > server tx1-int-hbase-main-prod-3:16020 > > > > > org.apache.hbase.thirdparty.io.netty.channel.ConnectTimeoutException: > > > > connection timed out after 10000 ms: tx1-int-hbase-main-prod-3/ > > > > 127.0.0.1:16020 > > > > at > > > > > > > > org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:615) > > > > ~[hbase-shaded-netty-4.1.7.jar:?] > > > > at > > > > > > > > org.apache.hbase.thirdparty.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) > > > > ~[hbase-shaded-netty-4.1.7.jar:?] > > > > at > > > > > > > > org.apache.hbase.thirdparty.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153) > > > > ~[hbase-shaded-netty-4.1.7.jar:?] > > > > at > > > > > > > > org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) > > > > ~[hbase-shaded-netty-4.1.7.jar:?] > > > > at > > > > > > > > org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) > > > > ~[hbase-shaded-netty-4.1.7.jar:?] > > > > at > > > > > > > > org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) > > > > ~[hbase-shaded-netty-4.1.7.jar:?] > > > > at > > > > > > > > org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416) > > > > ~[hbase-shaded-netty-4.1.7.jar:?] > > > > at > > > > > > > > org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) > > > > ~[hbase-shaded-netty-4.1.7.jar:?] > > > > at > > > > > > > > org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > > > > ~[hbase-shaded-netty-4.1.7.jar:?] > > > > at > > > > > > > > org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > > > > ~[hbase-shaded-netty-4.1.7.jar:?] > > > > at > java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202] > > > > >