Il gio 8 mar 2018, 20:52 Sijie Guo <guosi...@gmail.com> ha scritto: > On Thu, Mar 8, 2018 at 7:42 AM, Enrico Olivelli <eolive...@gmail.com> > wrote: > >> >> >> 2018-03-08 14:50 GMT+01:00 Ivan Kelly <iv...@apache.org>: >> >>> It just occurred to me that this could be a problem with the recycler. >>> If we recycle a buffer too early, but then keep using it, another user >>> could pick it up, and between them they could corrupt the data that >>> would cause it to be unhandled. >>> >> >> >> Maybe current work from Sijie, Andrey and JV which Sijie is backporting >> to branch-4.6 could resolve the issue >> > > > I think that is unrelated. The change Andrey, JV and me dealt with is the > PendingAddOp. It is on client side not bookie side. > > >> >> >> >> >> >>> >>> -Ivan >>> >>> On Thu, Mar 8, 2018 at 1:49 PM, Enrico Olivelli <eolive...@gmail.com> >>> wrote: >>> > >>> > >>> > 2018-03-08 11:57 GMT+01:00 Ivan Kelly <iv...@apache.org>: >>> >> >>> >> Hmm, yes, looks messed up. Do you have a reliable repro of this? >>> >> >>> >> "at io.netty.channel.DefaultChannelPipeline.onUnhandledInboundMessage" >>> >> <- it seems the message itself wasn't handled. Do you have any more >>> >> bookie side logs leading up to this event? >>> > >>> > >>> > >>> > The only suspicious logs are for logger named >>> > "org.apache.bookkeeper.bookie.EntryLogger" >>> > >>> > >>> > 18-03-08-09-22-40 Created new entry log file >>> > /data/qasql/bookie/./bookie_data/current/4b0.log for logId 1200. >>> > 18-03-08-09-32-41 Failed to get ledgers map index from: 53.log : >>> No >>> > ledgers map index found on entryLogId53 >>> > 18-03-08-09-33-10 Failed to get ledgers map index from: 54.log : >>> No >>> > ledgers map index found on entryLogId54 >>> > 18-03-08-09-33-10 Failed to get ledgers map index from: 375.log >>> : No >>> > ledgers map index found on entryLogId375 >>> > 18-03-08-09-33-13 Failed to get ledgers map index from: 878.log >>> : No >>> > ledgers map index found on entryLogId878 >>> > 18-03-08-09-33-22 Failed to get ledgers map index from: 1197.log >>> : No >>> > ledgers map index found on entryLogId1197 >>> > 18-03-08-09-33-57 Failed to get ledgers map index from: 1198.log >>> : No >>> > ledgers map index found on entryLogId1198 >>> > 18-03-08-09-34-54 Failed to get ledgers map index from: 1199.log >>> : No >>> > ledgers map index found on entryLogId1199 >>> >> > > This is also unrelated. It is a fine logging, since it means it attempts > to read ledgers map index, but didn't find one. > > the ledgers map index is only written when it is enabled at the bookies. > > >> > >>> > I will try to run Bookie with level debug, but it produces lots of spam >>> > >>> > Thanks >>> > Enrico >>> > >>> >> >>> >> >>> >> -Ivan >>> >> >>> >> On Thu, Mar 8, 2018 at 9:18 AM, Enrico Olivelli <eolive...@gmail.com> >>> >> wrote: >>> >> > Hi all, >>> >> > I am seeing this bad errors in some test environments with 4.6.1 >>> >> > The errors appear during rolling restart of the application, the >>> test >>> >> > env is >>> >> > made of 6 machines: >>> >> > - 3 bookies >>> >> > - 3 client machines (with several BK clients, of different >>> sub-systems) >>> >> > - running with 4.6.1 both client and servers >>> >> > >>> >> > *I do not have* reports of this errors from production, already >>> running >>> >> > 4.6.1 for the last month >>> >> > >>> >> > But the problem is quite scary >>> >> > >>> >> > This is a sample of relevant errors on clients (in this case >>> Majordodo >>> >> > brokers, with log level = INFO) >>> >> > >>> >> > logs/org.apache.bookkeeper.client.PendingAddOp: >>> >> > 18-03-07-09-51-43 Write of ledger entry to quorum failed: >>> L366634 >>> >> > E2557 >>> >> > 18-03-07-09-51-43 Write of ledger entry to quorum failed: >>> L366634 >>> >> > E2558 >>> >> > 18-03-07-09-51-43 Write of ledger entry to quorum failed: >>> L366634 >>> >> > E2559 >>> >> > 18-03-07-15-59-55 Failed to write entry (366680, 1865): Bookie >>> >> > operation timeout >>> >> > 18-03-07-15-59-55 Failed to write entry (366680, 1865): Bookie >>> >> > operation timeout >>> >> > 18-03-07-16-00-00 Failed to write entry (366680, 1865): Bookie >>> >> > operation timeout >>> >> > 18-03-07-16-00-00 Failed to write entry (366680, 1865): Bookie >>> >> > operation timeout >>> >> > 18-03-07-16-00-05 Failed to write entry (366680, 1865): Bookie >>> >> > operation timeout >>> >> > >>> >> > >>> >> > org.apache.bookkeeper.proto.PerChannelBookieClient: >>> >> > >>> >> > 18-03-07-10-06-44 Unexpected exception caught by bookie client >>> >> > channel >>> >> > handler >>> >> > 18-03-07-10-06-44 io.netty.handler.codec.DecoderException: >>> >> > java.lang.IllegalStateException: Received unknown response : op >>> code = 6 >>> >> > io.netty.handler.codec.DecoderException: >>> >> > java.lang.IllegalStateException: >>> >> > Received unknown response : op code = 6 >>> >> > at >>> >> > >>> >> > >>> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:98) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) >>> >> > at >>> >> > >>> >> > >>> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) >>> >> > at >>> >> > >>> >> > >>> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) >>> >> > at >>> >> > >>> >> > >>> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) >>> >> > at >>> >> > >>> >> > >>> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1414) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:945) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:404) >>> >> > at >>> >> > io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:304) >>> >> > at >>> >> > >>> >> > >>> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886) >>> >> > at java.base/java.lang.Thread.run(Thread.java:844) >>> >> > Caused by: java.lang.IllegalStateException: Received unknown >>> response : >>> >> > op >>> >> > code = 6 >>> >> > at >>> >> > >>> >> > org.apache.bookkeeper.proto.Bo >>> okieProtoEncoding$ResponseEnDeCoderPreV3.decode(BookieProtoEncoding.java:294) >>> >> > at >>> >> > >>> >> > org.apache.bookkeeper.proto.Bo >>> okieProtoEncoding$ResponseDecoder.decode(BookieProtoEncoding.java:478) >>> >> > at >>> >> > >>> >> > >>> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88) >>> >> > >>> >> > >>> >> > ... 19 more >>> >> > >>> >> > >>> >> > Errors on Bookie side, with log level = INFO) >>> >> > >>> >> > >>> >> > io.netty.channel.DefaultChannelPipeline: >>> >> > 8-03-07-09-51-38 An exceptionCaught() event was fired, and it >>> >> > reached >>> >> > at the tail of the pipeline. It usually means the last handler in >>> the >>> >> > pipeline did not handle the exception. >>> >> > 18-03-07-09-51-38 >>> io.netty.util.IllegalReferenceCountException: >>> >> > refCnt: 0, increment: 1 >>> >> > io.netty.util.IllegalReferenceCountException: refCnt: 0, increment: >>> 1 >>> >> > at >>> >> > >>> >> > >>> io.netty.buffer.AbstractReferenceCountedByteBuf.release0(AbstractReferenceCountedByteBuf.java:100) >>> >> > at >>> >> > >>> >> > >>> io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:84) >>> >> > at >>> >> > io.netty.util.ReferenceCountUtil.release(ReferenceCountUtil.java:88) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.DefaultChannelPipeline.onUnhandledInboundMessage(DefaultChannelPipeline.java:1202) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.DefaultChannelPipeline$TailContext.channelRead(DefaultChannelPipeline.java:1299) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) >>> >> > at >>> >> > >>> >> > org.apache.bookkeeper.proto.Bo >>> okieRequestHandler.channelRead(BookieRequestHandler.java:78) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) >>> >> > at >>> >> > >>> >> > org.apache.bookkeeper.proto.Au >>> thHandler$ServerSideHandler.channelRead(AuthHandler.java:88) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) >>> >> > at >>> >> > >>> >> > >>> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) >>> >> > at >>> >> > >>> >> > >>> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) >>> >> > at >>> >> > >>> >> > >>> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) >>> >> > at >>> >> > >>> >> > >>> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) >>> >> > at >>> >> > >>> >> > >>> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1414) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:945) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) >>> >> > at >>> >> > >>> >> > >>> io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:404) >>> >> > at >>> >> > io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:304) >>> >> > at >>> >> > >>> >> > >>> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886) >>> >> > at java.base/java.lang.Thread.run(Thread.java:844) >>> >> > questions: > > 1) which protocol version are you using? >
V3 > 2) is the exception coming from read response or write response? it sound > like write response. If it is write response, it would be surprised about > the ref count. because we don't send any buffer back in write response. > There is no clue of read error, only those logs about write failures Thank you for support Enrico > > >> >> > >>> >> > >>> > >>> > >>> >> >> > -- -- Enrico Olivelli