Re: qpid-cpp-0.35 errors

rammohan ganapavarapu Wed, 07 Nov 2018 12:26:43 -0800

Kim,

Ok, i am still trying to see what part of my java application is causing
that issue, yes that issue is happening intermittently. Regarding
"JERR_WMGR_ENQDISCONT" error, may be they are chained exceptions from the
previous error JERR_JCNTL_AIOCMPLWAIT?


Does message size contribute to this issue?

Thanks,
Ram

On Wed, Nov 7, 2018 at 11:37 AM Kim van der Riet <[email protected]>
wrote:

> No, they are not.
>
> These two defines govern the number of sleeps and the sleep time while
> waiting for before throwing an exception during recovery only. They do
> not play a role during normal operation.
>
> If you are able to compile the broker code, you can try playing with
> these values. But I don't think they will make much difference to the
> overall problem. I think some of the other errors you have been seeing
> prior to this one are closer to where the real problem lies - such as
> the JRNL_WMGR_ENQDISCONT error.
>
> Do you have a reproducer of any kind? Does this error occur predictably
> under some or other conditions?
>
> Thanks,
>
> Kim van der Riet
>
> On 11/7/18 12:51 PM, rammohan ganapavarapu wrote:
> > Kim,
> >
> > I see these two settings from code, can these be configurable?
> >
> > #define MAX_AIO_SLEEPS 100000 // tot: ~1 sec
> >
> > #define AIO_SLEEP_TIME_US  10 // 0.01 ms
> >
> >
> > Ram
> >
> > On Wed, Nov 7, 2018 at 7:04 AM rammohan ganapavarapu <
> > [email protected]> wrote:
> >
> >> Thank you Kim, i will try your suggestions.
> >>
> >> On Wed, Nov 7, 2018, 6:58 AM Kim van der Riet <[email protected]
> wrote:
> >>
> >>> This error is a linearstore issue. It looks as though there is a single
> >>> write operation to disk that has become stuck, and is holding up all
> >>> further write operations. This happens because there is a fixed
> circular
> >>> pool of memory pages used for the AIO operations to disk, and when one
> >>> of these is "busy" (indicated by the A letter in the  page state map),
> >>> write operations cannot continue until it is cleared. It it does not
> >>> clear within a certain time, then an exception is thrown, which usually
> >>> results in the broker closing the connection.
> >>>
> >>> The events leading up to a "stuck" write operation are complex and
> >>> sometimes difficult to reproduce. If you have a reproducer, then I
> would
> >>> be interested to see it! Even so, the ability to reproduce on another
> >>> machine is hard as it depends on such things as disk write speed, the
> >>> disk controller characteristics, the number of threads in the thread
> >>> pool (ie CPU type), memory and other hardware-related things.
> >>>
> >>> There are two linearstore parameters that you can try playing with to
> >>> see if you can change the behavior of the store:
> >>>
> >>> wcache-page-size: This sets the size of each page in the write buffer.
> >>> Larger page size is good for large messages, a smaller size will help
> if
> >>> you have small messages.
> >>>
> >>> wchache-num-pages: The total number of pages in the write buffer.
> >>>
> >>> Use the --help on the broker with the linearstore loaded to see more
> >>> details on this. I hope that helps a little.
> >>>
> >>> Kim van der Riet
> >>>
> >>> On 11/6/18 2:12 PM, rammohan ganapavarapu wrote:
> >>>> Any help in understand why/when broker throws those errors and stop
> >>>> receiving message would be appreciated.
> >>>>
> >>>> Not sure if any kernel tuning or broker tuning needs to be done to
> >>>> solve this issue.
> >>>>
> >>>> Thanks in advance,
> >>>> Ram
> >>>>
> >>>> On Tue, Nov 6, 2018 at 8:35 AM rammohan ganapavarapu <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> Also from this log message (store level) it seems like waiting for
> AIO
> >>> to
> >>>>> complete.
> >>>>>
> >>>>> 2018-10-28 12:27:01 [Store] critical Linear Store: Journal "<journal
> >>>>> name>": get_events() returned JERR_JCNTL_AIOCMPLWAIT;
> >>>>> wmgr_status: wmgr: pi=25 pc=8 po=0 aer=1 edac=TFFF
> >>>>> ps=[-------------------------A------]
> >>>>>
> >>>>> page_state ps=[-------------------------A------]  where A is
> >>> AIO_PENDING
> >>>>> aer=1 _aio_evt_rem;          ///< Remaining AIO events
> >>>>>
> >>>>> When there is or there are pending AIO, does broker close the
> >>> connection?
> >>>>> is there any tuning that can be done to resolve this?
> >>>>>
> >>>>> Thanks,
> >>>>> Ram
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Mon, Nov 5, 2018 at 8:55 PM rammohan ganapavarapu <
> >>>>> [email protected]> wrote:
> >>>>>
> >>>>>> I was check the code and i see these lines for that AIO timeout.
> >>>>>>
> >>>>>>                 case
> >>> qpid::linearstore::journal::RHM_IORES_PAGE_AIOWAIT:
> >>>>>>                   if (++aio_sleep_cnt > MAX_AIO_SLEEPS)
> >>>>>>                       THROW_STORE_EXCEPTION("Timeout waiting for
> AIO in
> >>>>>> MessageStoreImpl::recoverMessages()");
> >>>>>>                   ::usleep(AIO_SLEEP_TIME_US);
> >>>>>>                   break;
> >>>>>>
> >>>>>> And these are the defaults
> >>>>>>
> >>>>>> #define MAX_AIO_SLEEPS 100000 // tot: ~1 sec
> >>>>>>
> >>>>>> #define AIO_SLEEP_TIME_US  10 // 0.01 ms
> >>>>>>
> >>>>>>
> >>>>>>     RHM_IORES_PAGE_AIOWAIT, ///< IO operation suspended - next page
> is
> >>>>>> waiting for AIO.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> So does page got blocked and its waiting for page availability?
> >>>>>>
> >>>>>>
> >>>>>> Ram
> >>>>>>
> >>>>>> On Mon, Nov 5, 2018 at 8:00 PM rammohan ganapavarapu <
> >>>>>> [email protected]> wrote:
> >>>>>>
> >>>>>>> Actually we have upgraded from qpid-cpp 0.28 to 1.35 and after that
> >>> we
> >>>>>>> see this message
> >>>>>>>
> >>>>>>> 2018-10-27 18:58:25 [Store] warning Linear Store: Journal
> >>>>>>> "<journal-name>": Bad record alignment found at fid=0x4605b
> >>> offs=0x107680
> >>>>>>> (likely journal overwrite boundary); 19 filler record(s) required.
> >>>>>>> 2018-10-27 18:58:25 [Store] notice Linear Store: Journal
> >>>>>>> "<journal-name>": Recover phase write: Wrote filler record:
> >>> fid=0x4605b
> >>>>>>> offs=0x107680
> >>>>>>> 2018-10-27 18:58:25 [Store] notice Linear Store: Journal
> >>>>>>> "<journal-name>": Recover phase write: Wr... few more Recover phase
> >>> logs
> >>>>>>> It worked fine for a day and started throwing this message:
> >>>>>>>
> >>>>>>> 2018-10-28 12:27:01 [Store] critical Linear Store: Journal
> "<name>":
> >>>>>>> get_events() returned JERR_JCNTL_AIOCMPLWAIT; wmgr_status: wmgr:
> >>> pi=25 pc=8
> >>>>>>> po=0 aer=1 edac=TFFF ps=[-------------------------A------]
> >>>>>>> 2018-10-28 12:27:01 [Broker] warning Exchange <name> cannot deliver
> >>> to
> >>>>>>> queue <queue-name>: Queue <queue-name>: MessageStoreImpl::store()
> >>> failed:
> >>>>>>> jexception 0x0202 jcntl::handle_aio_wait() threw
> >>> JERR_JCNTL_AIOCMPLWAIT:
> >>>>>>> Timeout waiting for AIOs to complete.
> >>>>>>>
> >>>
> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)
> >>>>>>> 2018-10-28 12:27:01 [Broker] error Connection exception:
> >>> framing-error:
> >>>>>>> Queue <queue-name>: MessageStoreImpl::store() failed: jexception
> >>> 0x0202
> >>>>>>> jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout
> >>> waiting for
> >>>>>>> AIOs to complete.
> >>>>>>>
> >>>
> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)
> >>>>>>> 2018-10-28 12:27:01 [Protocol] error Connection
> >>>>>>> qpid.server-ip:5672-client-ip:44457 closed by error: Queue
> >>> <queue-name>:
> >>>>>>> MessageStoreImpl::store() failed: jexception 0x0202
> >>>>>>> jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout
> >>> waiting for
> >>>>>>> AIOs to complete.
> >>>>>>>
> >>>
> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)(501)
> >>>>>>> 2018-10-28 12:27:01 [Protocol] error Connection
> >>>>>>> qpid.server-ip:5672-client-ip:44457 closed by error:
> >>> illegal-argument:
> >>>>>>> Value for replyText is too large(320)
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Ram
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Nov 5, 2018 at 3:34 PM rammohan ganapavarapu <
> >>>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>>> No, local disk.
> >>>>>>>>
> >>>>>>>> On Mon, Nov 5, 2018 at 3:26 PM Gordon Sim <[email protected]>
> wrote:
> >>>>>>>>
> >>>>>>>>> On 05/11/18 22:58, rammohan ganapavarapu wrote:
> >>>>>>>>>> Gordon,
> >>>>>>>>>>
> >>>>>>>>>> We are using java client 0.28 version and qpidd-cpp 1.35 version
> >>>>>>>>>> (qpid-cpp-server-1.35.0-1.el7.x86_64), i dont know at what
> >>> scenario
> >>>>>>>>> its
> >>>>>>>>>> happening but after i restart broker and if we wait for few days
> >>> its
> >>>>>>>>>> happening again. From the above logs do you have any pointers to
> >>>>>>>>> check?
> >>>>>>>>>
> >>>>>>>>> Are you using NFS?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>> ---------------------------------------------------------------------
> >>>>>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>>>>> For additional commands, e-mail: [email protected]
> >>>>>>>>>
> >>>>>>>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: qpid-cpp-0.35 errors

Reply via email to