On Thu, Mar 8, 2018 at 1:05 PM, Sijie Guo <guosi...@gmail.com> wrote:

>
> On Thu, Mar 8, 2018 at 12:33 PM Andrey Yegorov <andrey.yego...@gmail.com>
> wrote:
>
>> I am looking more at the PendigAddOp and it looks like, in addition to
>> the case that Sijie has fixed, there is another scenario where recycler can
>> get triggered.
>>
>> I.e.:
>> first sendWriteRequest() in PendingAddOp.safeRun() fails in PCBC's
>> writeAndFlush (channel == null or channel.writeAndFlush throws)
>> hasRun is set to true at that point. pendingWriteRequests is still zero
>> (not incremented yet)
>> PCBC's errorOut triggers callback that is sure that since
>> pendingWriteRequests==0 and hasRun it can release PendingAddOp back to
>> recycler
>>
>> now there are still sendWriteRequest's to follow and all bets are off.
>> i.e. by the time they are executed the addOp can get reused so they end up
>> sending another request to the wrong bookie.
>>
>> So far my idea of solution for that is to run PCBC's errorOut
>> asynchronously / submit it back to orderedSafeExecutor so addOp can
>> increment counts properly.
>>
>
> Oh PendingAddOp should be executed at safe executor. Otherwise it is not
> just recycle problem, it might be a correctness issue.
>

This is a good solution and I believe lot of areas assumes this order. In
my original fix I moved hasRun to the end of the method, intention was
correct
(to set this only at the end of run) but the way I did was not right, hence
moved it back to the top of the method.

JV


>
>
>>
>>
>> ----------
>> Andrey Yegorov
>>
>> On Thu, Mar 8, 2018 at 11:52 AM, Sijie Guo <guosi...@gmail.com> wrote:
>>
>>>
>>>
>>> On Thu, Mar 8, 2018 at 7:42 AM, Enrico Olivelli <eolive...@gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> 2018-03-08 14:50 GMT+01:00 Ivan Kelly <iv...@apache.org>:
>>>>
>>>>> It just occurred to me that this could be a problem with the recycler.
>>>>> If we recycle a buffer too early, but then keep using it, another user
>>>>> could pick it up, and between them they could corrupt the data that
>>>>> would cause it to be unhandled.
>>>>>
>>>>
>>>>
>>>> Maybe current work from Sijie, Andrey and JV which Sijie is backporting
>>>> to branch-4.6 could resolve the issue
>>>>
>>>
>>>
>>> I think that is unrelated. The change Andrey, JV and me dealt with is
>>> the PendingAddOp. It is on client side not bookie side.
>>>
>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> -Ivan
>>>>>
>>>>> On Thu, Mar 8, 2018 at 1:49 PM, Enrico Olivelli <eolive...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> >
>>>>> > 2018-03-08 11:57 GMT+01:00 Ivan Kelly <iv...@apache.org>:
>>>>> >>
>>>>> >> Hmm, yes, looks messed up. Do you have a reliable repro of this?
>>>>> >>
>>>>> >> "at io.netty.channel.DefaultChannelPipeline.
>>>>> onUnhandledInboundMessage"
>>>>> >> <- it seems the message itself wasn't handled. Do you have any more
>>>>> >> bookie side logs leading up to this event?
>>>>> >
>>>>> >
>>>>> >
>>>>> > The only suspicious logs are for logger named
>>>>> > "org.apache.bookkeeper.bookie.EntryLogger"
>>>>> >
>>>>> >
>>>>> > 18-03-08-09-22-40       Created new entry log file
>>>>> > /data/qasql/bookie/./bookie_data/current/4b0.log for logId 1200.
>>>>> > 18-03-08-09-32-41       Failed to get ledgers map index from: 53.log
>>>>> : No
>>>>> > ledgers map index found on entryLogId53
>>>>> > 18-03-08-09-33-10       Failed to get ledgers map index from: 54.log
>>>>> : No
>>>>> > ledgers map index found on entryLogId54
>>>>> > 18-03-08-09-33-10       Failed to get ledgers map index from:
>>>>> 375.log : No
>>>>> > ledgers map index found on entryLogId375
>>>>> > 18-03-08-09-33-13       Failed to get ledgers map index from:
>>>>> 878.log : No
>>>>> > ledgers map index found on entryLogId878
>>>>> > 18-03-08-09-33-22       Failed to get ledgers map index from:
>>>>> 1197.log : No
>>>>> > ledgers map index found on entryLogId1197
>>>>> > 18-03-08-09-33-57       Failed to get ledgers map index from:
>>>>> 1198.log : No
>>>>> > ledgers map index found on entryLogId1198
>>>>> > 18-03-08-09-34-54       Failed to get ledgers map index from:
>>>>> 1199.log : No
>>>>> > ledgers map index found on entryLogId1199
>>>>>
>>>>
>>>
>>> This is also unrelated. It is a fine logging, since it means it attempts
>>> to read ledgers map index, but didn't find one.
>>>
>>> the ledgers map index is only written when it is enabled at the bookies.
>>>
>>>
>>>> >
>>>>> > I will try to run Bookie with level debug, but it produces lots of
>>>>> spam
>>>>> >
>>>>> > Thanks
>>>>> > Enrico
>>>>> >
>>>>> >>
>>>>> >>
>>>>> >> -Ivan
>>>>> >>
>>>>> >> On Thu, Mar 8, 2018 at 9:18 AM, Enrico Olivelli <
>>>>> eolive...@gmail.com>
>>>>> >> wrote:
>>>>> >> > Hi all,
>>>>> >> > I am seeing this bad errors in some test environments with 4.6.1
>>>>> >> > The errors appear during rolling restart of the application, the
>>>>> test
>>>>> >> > env is
>>>>> >> > made of 6 machines:
>>>>> >> > - 3 bookies
>>>>> >> > - 3 client machines (with several BK clients, of different
>>>>> sub-systems)
>>>>> >> > - running with 4.6.1 both client and servers
>>>>> >> >
>>>>> >> > *I do not have* reports of this errors from production, already
>>>>> running
>>>>> >> > 4.6.1 for the last month
>>>>> >> >
>>>>> >> > But the problem is quite scary
>>>>> >> >
>>>>> >> > This is a sample of relevant errors on clients (in this case
>>>>> Majordodo
>>>>> >> > brokers, with log level = INFO)
>>>>> >> >
>>>>> >> > logs/org.apache.bookkeeper.client.PendingAddOp:
>>>>> >> > 18-03-07-09-51-43       Write of ledger entry to quorum failed:
>>>>> L366634
>>>>> >> > E2557
>>>>> >> > 18-03-07-09-51-43       Write of ledger entry to quorum failed:
>>>>> L366634
>>>>> >> > E2558
>>>>> >> > 18-03-07-09-51-43       Write of ledger entry to quorum failed:
>>>>> L366634
>>>>> >> > E2559
>>>>> >> > 18-03-07-15-59-55       Failed to write entry (366680, 1865):
>>>>> Bookie
>>>>> >> > operation timeout
>>>>> >> > 18-03-07-15-59-55       Failed to write entry (366680, 1865):
>>>>> Bookie
>>>>> >> > operation timeout
>>>>> >> > 18-03-07-16-00-00       Failed to write entry (366680, 1865):
>>>>> Bookie
>>>>> >> > operation timeout
>>>>> >> > 18-03-07-16-00-00       Failed to write entry (366680, 1865):
>>>>> Bookie
>>>>> >> > operation timeout
>>>>> >> > 18-03-07-16-00-05       Failed to write entry (366680, 1865):
>>>>> Bookie
>>>>> >> > operation timeout
>>>>> >> >
>>>>> >> >
>>>>> >> > org.apache.bookkeeper.proto.PerChannelBookieClient:
>>>>> >> >
>>>>> >> > 18-03-07-10-06-44       Unexpected exception caught by bookie
>>>>> client
>>>>> >> > channel
>>>>> >> > handler
>>>>> >> > 18-03-07-10-06-44       io.netty.handler.codec.DecoderException:
>>>>> >> > java.lang.IllegalStateException: Received unknown response : op
>>>>> code = 6
>>>>> >> > io.netty.handler.codec.DecoderException:
>>>>> >> > java.lang.IllegalStateException:
>>>>> >> > Received unknown response : op code = 6
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.handler.codec.MessageToMessageDecoder.channelRead(
>>>>> MessageToMessageDecoder.java:98)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:362)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:348)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
>>>>> AbstractChannelHandlerContext.java:340)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>>>> ByteToMessageDecoder.java:310)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>>>> ByteToMessageDecoder.java:297)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.handler.codec.ByteToMessageDecoder.callDecode(
>>>>> ByteToMessageDecoder.java:413)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.handler.codec.ByteToMessageDecoder.channelRead(
>>>>> ByteToMessageDecoder.java:265)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:362)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:348)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
>>>>> AbstractChannelHandlerContext.java:340)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(
>>>>> DefaultChannelPipeline.java:1414)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:362)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:348)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.DefaultChannelPipeline.fireChannelRead(
>>>>> DefaultChannelPipeline.java:945)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.epoll.AbstractEpollStreamChannel$
>>>>> EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.epoll.EpollEventLoop.processReady(
>>>>> EpollEventLoop.java:404)
>>>>> >> >         at
>>>>> >> > io.netty.channel.epoll.EpollEventLoop.run(
>>>>> EpollEventLoop.java:304)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.util.concurrent.SingleThreadEventExecutor$5.
>>>>> run(SingleThreadEventExecutor.java:886)
>>>>> >> >         at java.base/java.lang.Thread.run(Thread.java:844)
>>>>> >> > Caused by: java.lang.IllegalStateException: Received unknown
>>>>> response :
>>>>> >> > op
>>>>> >> > code = 6
>>>>> >> >         at
>>>>> >> >
>>>>> >> > org.apache.bookkeeper.proto.BookieProtoEncoding$
>>>>> ResponseEnDeCoderPreV3.decode(BookieProtoEncoding.java:294)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > org.apache.bookkeeper.proto.BookieProtoEncoding$
>>>>> ResponseDecoder.decode(BookieProtoEncoding.java:478)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.handler.codec.MessageToMessageDecoder.channelRead(
>>>>> MessageToMessageDecoder.java:88)
>>>>> >> >
>>>>> >> >
>>>>> >> >         ... 19 more
>>>>> >> >
>>>>> >> >
>>>>> >> > Errors on Bookie side, with log level = INFO)
>>>>> >> >
>>>>> >> >
>>>>> >> > io.netty.channel.DefaultChannelPipeline:
>>>>> >> > 8-03-07-09-51-38       An exceptionCaught() event was fired, and
>>>>> it
>>>>> >> > reached
>>>>> >> > at the tail of the pipeline. It usually means the last handler in
>>>>> the
>>>>> >> > pipeline did not handle the exception.
>>>>> >> > 18-03-07-09-51-38       io.netty.util.
>>>>> IllegalReferenceCountException:
>>>>> >> > refCnt: 0, increment: 1
>>>>> >> > io.netty.util.IllegalReferenceCountException: refCnt: 0,
>>>>> increment: 1
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.buffer.AbstractReferenceCountedByteBuf.release0(
>>>>> AbstractReferenceCountedByteBuf.java:100)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.buffer.AbstractReferenceCountedByteBuf.release(
>>>>> AbstractReferenceCountedByteBuf.java:84)
>>>>> >> >         at
>>>>> >> > io.netty.util.ReferenceCountUtil.release(
>>>>> ReferenceCountUtil.java:88)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.DefaultChannelPipeline.
>>>>> onUnhandledInboundMessage(DefaultChannelPipeline.java:1202)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.DefaultChannelPipeline$TailContext.channelRead(
>>>>> DefaultChannelPipeline.java:1299)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:362)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:348)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
>>>>> AbstractChannelHandlerContext.java:340)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.ChannelInboundHandlerAdapter.channelRead(
>>>>> ChannelInboundHandlerAdapter.java:86)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:362)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:348)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
>>>>> AbstractChannelHandlerContext.java:340)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > org.apache.bookkeeper.proto.BookieRequestHandler.channelRead(
>>>>> BookieRequestHandler.java:78)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:362)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:348)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
>>>>> AbstractChannelHandlerContext.java:340)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.ChannelInboundHandlerAdapter.channelRead(
>>>>> ChannelInboundHandlerAdapter.java:86)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > org.apache.bookkeeper.proto.AuthHandler$ServerSideHandler.
>>>>> channelRead(AuthHandler.java:88)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:362)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:348)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
>>>>> AbstractChannelHandlerContext.java:340)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.handler.codec.MessageToMessageDecoder.channelRead(
>>>>> MessageToMessageDecoder.java:102)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:362)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:348)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
>>>>> AbstractChannelHandlerContext.java:340)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>>>> ByteToMessageDecoder.java:310)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>>>> ByteToMessageDecoder.java:297)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.handler.codec.ByteToMessageDecoder.callDecode(
>>>>> ByteToMessageDecoder.java:413)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.handler.codec.ByteToMessageDecoder.channelRead(
>>>>> ByteToMessageDecoder.java:265)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:362)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:348)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
>>>>> AbstractChannelHandlerContext.java:340)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(
>>>>> DefaultChannelPipeline.java:1414)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:362)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
>>>>> AbstractChannelHandlerContext.java:348)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.DefaultChannelPipeline.fireChannelRead(
>>>>> DefaultChannelPipeline.java:945)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.epoll.AbstractEpollStreamChannel$
>>>>> EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.channel.epoll.EpollEventLoop.processReady(
>>>>> EpollEventLoop.java:404)
>>>>> >> >         at
>>>>> >> > io.netty.channel.epoll.EpollEventLoop.run(
>>>>> EpollEventLoop.java:304)
>>>>> >> >         at
>>>>> >> >
>>>>> >> > io.netty.util.concurrent.SingleThreadEventExecutor$5.
>>>>> run(SingleThreadEventExecutor.java:886)
>>>>> >> >         at java.base/java.lang.Thread.run(Thread.java:844)
>>>>>
>>>>
>>> questions:
>>>
>>> 1) which protocol version are you using?
>>>
>>> 2) is the exception coming from read response or write response? it
>>> sound like write response. If it is write response, it would be surprised
>>> about the ref count. because we don't send any buffer back in write
>>> response.
>>>
>>>
>>>> >> >
>>>>> >> >
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>


-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi

Reply via email to