Kim/Gordon,

I was wrong about the NFS for qpid journal files, looks like they are on
NFS, so does NFS cause this issue?

Ram

On Wed, Nov 7, 2018 at 12:18 PM rammohan ganapavarapu <
[email protected]> wrote:

> Kim,
>
> Ok, i am still trying to see what part of my java application is causing
> that issue, yes that issue is happening intermittently. Regarding
> "JERR_WMGR_ENQDISCONT" error, may be they are chained exceptions from the
> previous error JERR_JCNTL_AIOCMPLWAIT?
>
> Does message size contribute to this issue?
>
> Thanks,
> Ram
>
> On Wed, Nov 7, 2018 at 11:37 AM Kim van der Riet <[email protected]>
> wrote:
>
>> No, they are not.
>>
>> These two defines govern the number of sleeps and the sleep time while
>> waiting for before throwing an exception during recovery only. They do
>> not play a role during normal operation.
>>
>> If you are able to compile the broker code, you can try playing with
>> these values. But I don't think they will make much difference to the
>> overall problem. I think some of the other errors you have been seeing
>> prior to this one are closer to where the real problem lies - such as
>> the JRNL_WMGR_ENQDISCONT error.
>>
>> Do you have a reproducer of any kind? Does this error occur predictably
>> under some or other conditions?
>>
>> Thanks,
>>
>> Kim van der Riet
>>
>> On 11/7/18 12:51 PM, rammohan ganapavarapu wrote:
>> > Kim,
>> >
>> > I see these two settings from code, can these be configurable?
>> >
>> > #define MAX_AIO_SLEEPS 100000 // tot: ~1 sec
>> >
>> > #define AIO_SLEEP_TIME_US  10 // 0.01 ms
>> >
>> >
>> > Ram
>> >
>> > On Wed, Nov 7, 2018 at 7:04 AM rammohan ganapavarapu <
>> > [email protected]> wrote:
>> >
>> >> Thank you Kim, i will try your suggestions.
>> >>
>> >> On Wed, Nov 7, 2018, 6:58 AM Kim van der Riet <[email protected]
>> wrote:
>> >>
>> >>> This error is a linearstore issue. It looks as though there is a
>> single
>> >>> write operation to disk that has become stuck, and is holding up all
>> >>> further write operations. This happens because there is a fixed
>> circular
>> >>> pool of memory pages used for the AIO operations to disk, and when one
>> >>> of these is "busy" (indicated by the A letter in the  page state map),
>> >>> write operations cannot continue until it is cleared. It it does not
>> >>> clear within a certain time, then an exception is thrown, which
>> usually
>> >>> results in the broker closing the connection.
>> >>>
>> >>> The events leading up to a "stuck" write operation are complex and
>> >>> sometimes difficult to reproduce. If you have a reproducer, then I
>> would
>> >>> be interested to see it! Even so, the ability to reproduce on another
>> >>> machine is hard as it depends on such things as disk write speed, the
>> >>> disk controller characteristics, the number of threads in the thread
>> >>> pool (ie CPU type), memory and other hardware-related things.
>> >>>
>> >>> There are two linearstore parameters that you can try playing with to
>> >>> see if you can change the behavior of the store:
>> >>>
>> >>> wcache-page-size: This sets the size of each page in the write buffer.
>> >>> Larger page size is good for large messages, a smaller size will help
>> if
>> >>> you have small messages.
>> >>>
>> >>> wchache-num-pages: The total number of pages in the write buffer.
>> >>>
>> >>> Use the --help on the broker with the linearstore loaded to see more
>> >>> details on this. I hope that helps a little.
>> >>>
>> >>> Kim van der Riet
>> >>>
>> >>> On 11/6/18 2:12 PM, rammohan ganapavarapu wrote:
>> >>>> Any help in understand why/when broker throws those errors and stop
>> >>>> receiving message would be appreciated.
>> >>>>
>> >>>> Not sure if any kernel tuning or broker tuning needs to be done to
>> >>>> solve this issue.
>> >>>>
>> >>>> Thanks in advance,
>> >>>> Ram
>> >>>>
>> >>>> On Tue, Nov 6, 2018 at 8:35 AM rammohan ganapavarapu <
>> >>>> [email protected]> wrote:
>> >>>>
>> >>>>> Also from this log message (store level) it seems like waiting for
>> AIO
>> >>> to
>> >>>>> complete.
>> >>>>>
>> >>>>> 2018-10-28 12:27:01 [Store] critical Linear Store: Journal "<journal
>> >>>>> name>": get_events() returned JERR_JCNTL_AIOCMPLWAIT;
>> >>>>> wmgr_status: wmgr: pi=25 pc=8 po=0 aer=1 edac=TFFF
>> >>>>> ps=[-------------------------A------]
>> >>>>>
>> >>>>> page_state ps=[-------------------------A------]  where A is
>> >>> AIO_PENDING
>> >>>>> aer=1 _aio_evt_rem;          ///< Remaining AIO events
>> >>>>>
>> >>>>> When there is or there are pending AIO, does broker close the
>> >>> connection?
>> >>>>> is there any tuning that can be done to resolve this?
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Ram
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Mon, Nov 5, 2018 at 8:55 PM rammohan ganapavarapu <
>> >>>>> [email protected]> wrote:
>> >>>>>
>> >>>>>> I was check the code and i see these lines for that AIO timeout.
>> >>>>>>
>> >>>>>>                 case
>> >>> qpid::linearstore::journal::RHM_IORES_PAGE_AIOWAIT:
>> >>>>>>                   if (++aio_sleep_cnt > MAX_AIO_SLEEPS)
>> >>>>>>                       THROW_STORE_EXCEPTION("Timeout waiting for
>> AIO in
>> >>>>>> MessageStoreImpl::recoverMessages()");
>> >>>>>>                   ::usleep(AIO_SLEEP_TIME_US);
>> >>>>>>                   break;
>> >>>>>>
>> >>>>>> And these are the defaults
>> >>>>>>
>> >>>>>> #define MAX_AIO_SLEEPS 100000 // tot: ~1 sec
>> >>>>>>
>> >>>>>> #define AIO_SLEEP_TIME_US  10 // 0.01 ms
>> >>>>>>
>> >>>>>>
>> >>>>>>     RHM_IORES_PAGE_AIOWAIT, ///< IO operation suspended - next
>> page is
>> >>>>>> waiting for AIO.
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> So does page got blocked and its waiting for page availability?
>> >>>>>>
>> >>>>>>
>> >>>>>> Ram
>> >>>>>>
>> >>>>>> On Mon, Nov 5, 2018 at 8:00 PM rammohan ganapavarapu <
>> >>>>>> [email protected]> wrote:
>> >>>>>>
>> >>>>>>> Actually we have upgraded from qpid-cpp 0.28 to 1.35 and after
>> that
>> >>> we
>> >>>>>>> see this message
>> >>>>>>>
>> >>>>>>> 2018-10-27 18:58:25 [Store] warning Linear Store: Journal
>> >>>>>>> "<journal-name>": Bad record alignment found at fid=0x4605b
>> >>> offs=0x107680
>> >>>>>>> (likely journal overwrite boundary); 19 filler record(s) required.
>> >>>>>>> 2018-10-27 18:58:25 [Store] notice Linear Store: Journal
>> >>>>>>> "<journal-name>": Recover phase write: Wrote filler record:
>> >>> fid=0x4605b
>> >>>>>>> offs=0x107680
>> >>>>>>> 2018-10-27 18:58:25 [Store] notice Linear Store: Journal
>> >>>>>>> "<journal-name>": Recover phase write: Wr... few more Recover
>> phase
>> >>> logs
>> >>>>>>> It worked fine for a day and started throwing this message:
>> >>>>>>>
>> >>>>>>> 2018-10-28 12:27:01 [Store] critical Linear Store: Journal
>> "<name>":
>> >>>>>>> get_events() returned JERR_JCNTL_AIOCMPLWAIT; wmgr_status: wmgr:
>> >>> pi=25 pc=8
>> >>>>>>> po=0 aer=1 edac=TFFF ps=[-------------------------A------]
>> >>>>>>> 2018-10-28 12:27:01 [Broker] warning Exchange <name> cannot
>> deliver
>> >>> to
>> >>>>>>> queue <queue-name>: Queue <queue-name>: MessageStoreImpl::store()
>> >>> failed:
>> >>>>>>> jexception 0x0202 jcntl::handle_aio_wait() threw
>> >>> JERR_JCNTL_AIOCMPLWAIT:
>> >>>>>>> Timeout waiting for AIOs to complete.
>> >>>>>>>
>> >>>
>> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)
>> >>>>>>> 2018-10-28 12:27:01 [Broker] error Connection exception:
>> >>> framing-error:
>> >>>>>>> Queue <queue-name>: MessageStoreImpl::store() failed: jexception
>> >>> 0x0202
>> >>>>>>> jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout
>> >>> waiting for
>> >>>>>>> AIOs to complete.
>> >>>>>>>
>> >>>
>> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)
>> >>>>>>> 2018-10-28 12:27:01 [Protocol] error Connection
>> >>>>>>> qpid.server-ip:5672-client-ip:44457 closed by error: Queue
>> >>> <queue-name>:
>> >>>>>>> MessageStoreImpl::store() failed: jexception 0x0202
>> >>>>>>> jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout
>> >>> waiting for
>> >>>>>>> AIOs to complete.
>> >>>>>>>
>> >>>
>> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)(501)
>> >>>>>>> 2018-10-28 12:27:01 [Protocol] error Connection
>> >>>>>>> qpid.server-ip:5672-client-ip:44457 closed by error:
>> >>> illegal-argument:
>> >>>>>>> Value for replyText is too large(320)
>> >>>>>>>
>> >>>>>>> Thanks,
>> >>>>>>> Ram
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Mon, Nov 5, 2018 at 3:34 PM rammohan ganapavarapu <
>> >>>>>>> [email protected]> wrote:
>> >>>>>>>
>> >>>>>>>> No, local disk.
>> >>>>>>>>
>> >>>>>>>> On Mon, Nov 5, 2018 at 3:26 PM Gordon Sim <[email protected]>
>> wrote:
>> >>>>>>>>
>> >>>>>>>>> On 05/11/18 22:58, rammohan ganapavarapu wrote:
>> >>>>>>>>>> Gordon,
>> >>>>>>>>>>
>> >>>>>>>>>> We are using java client 0.28 version and qpidd-cpp 1.35
>> version
>> >>>>>>>>>> (qpid-cpp-server-1.35.0-1.el7.x86_64), i dont know at what
>> >>> scenario
>> >>>>>>>>> its
>> >>>>>>>>>> happening but after i restart broker and if we wait for few
>> days
>> >>> its
>> >>>>>>>>>> happening again. From the above logs do you have any pointers
>> to
>> >>>>>>>>> check?
>> >>>>>>>>>
>> >>>>>>>>> Are you using NFS?
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>> ---------------------------------------------------------------------
>> >>>>>>>>> To unsubscribe, e-mail: [email protected]
>> >>>>>>>>> For additional commands, e-mail: [email protected]
>> >>>>>>>>>
>> >>>>>>>>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: [email protected]
>> >>> For additional commands, e-mail: [email protected]
>> >>>
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Reply via email to