Kim, Ok, i am still trying to see what part of my java application is causing that issue, yes that issue is happening intermittently. Regarding "JERR_WMGR_ENQDISCONT" error, may be they are chained exceptions from the previous error JERR_JCNTL_AIOCMPLWAIT?
Does message size contribute to this issue? Thanks, Ram On Wed, Nov 7, 2018 at 11:37 AM Kim van der Riet <[email protected]> wrote: > No, they are not. > > These two defines govern the number of sleeps and the sleep time while > waiting for before throwing an exception during recovery only. They do > not play a role during normal operation. > > If you are able to compile the broker code, you can try playing with > these values. But I don't think they will make much difference to the > overall problem. I think some of the other errors you have been seeing > prior to this one are closer to where the real problem lies - such as > the JRNL_WMGR_ENQDISCONT error. > > Do you have a reproducer of any kind? Does this error occur predictably > under some or other conditions? > > Thanks, > > Kim van der Riet > > On 11/7/18 12:51 PM, rammohan ganapavarapu wrote: > > Kim, > > > > I see these two settings from code, can these be configurable? > > > > #define MAX_AIO_SLEEPS 100000 // tot: ~1 sec > > > > #define AIO_SLEEP_TIME_US 10 // 0.01 ms > > > > > > Ram > > > > On Wed, Nov 7, 2018 at 7:04 AM rammohan ganapavarapu < > > [email protected]> wrote: > > > >> Thank you Kim, i will try your suggestions. > >> > >> On Wed, Nov 7, 2018, 6:58 AM Kim van der Riet <[email protected] > wrote: > >> > >>> This error is a linearstore issue. It looks as though there is a single > >>> write operation to disk that has become stuck, and is holding up all > >>> further write operations. This happens because there is a fixed > circular > >>> pool of memory pages used for the AIO operations to disk, and when one > >>> of these is "busy" (indicated by the A letter in the page state map), > >>> write operations cannot continue until it is cleared. It it does not > >>> clear within a certain time, then an exception is thrown, which usually > >>> results in the broker closing the connection. > >>> > >>> The events leading up to a "stuck" write operation are complex and > >>> sometimes difficult to reproduce. If you have a reproducer, then I > would > >>> be interested to see it! Even so, the ability to reproduce on another > >>> machine is hard as it depends on such things as disk write speed, the > >>> disk controller characteristics, the number of threads in the thread > >>> pool (ie CPU type), memory and other hardware-related things. > >>> > >>> There are two linearstore parameters that you can try playing with to > >>> see if you can change the behavior of the store: > >>> > >>> wcache-page-size: This sets the size of each page in the write buffer. > >>> Larger page size is good for large messages, a smaller size will help > if > >>> you have small messages. > >>> > >>> wchache-num-pages: The total number of pages in the write buffer. > >>> > >>> Use the --help on the broker with the linearstore loaded to see more > >>> details on this. I hope that helps a little. > >>> > >>> Kim van der Riet > >>> > >>> On 11/6/18 2:12 PM, rammohan ganapavarapu wrote: > >>>> Any help in understand why/when broker throws those errors and stop > >>>> receiving message would be appreciated. > >>>> > >>>> Not sure if any kernel tuning or broker tuning needs to be done to > >>>> solve this issue. > >>>> > >>>> Thanks in advance, > >>>> Ram > >>>> > >>>> On Tue, Nov 6, 2018 at 8:35 AM rammohan ganapavarapu < > >>>> [email protected]> wrote: > >>>> > >>>>> Also from this log message (store level) it seems like waiting for > AIO > >>> to > >>>>> complete. > >>>>> > >>>>> 2018-10-28 12:27:01 [Store] critical Linear Store: Journal "<journal > >>>>> name>": get_events() returned JERR_JCNTL_AIOCMPLWAIT; > >>>>> wmgr_status: wmgr: pi=25 pc=8 po=0 aer=1 edac=TFFF > >>>>> ps=[-------------------------A------] > >>>>> > >>>>> page_state ps=[-------------------------A------] where A is > >>> AIO_PENDING > >>>>> aer=1 _aio_evt_rem; ///< Remaining AIO events > >>>>> > >>>>> When there is or there are pending AIO, does broker close the > >>> connection? > >>>>> is there any tuning that can be done to resolve this? > >>>>> > >>>>> Thanks, > >>>>> Ram > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Mon, Nov 5, 2018 at 8:55 PM rammohan ganapavarapu < > >>>>> [email protected]> wrote: > >>>>> > >>>>>> I was check the code and i see these lines for that AIO timeout. > >>>>>> > >>>>>> case > >>> qpid::linearstore::journal::RHM_IORES_PAGE_AIOWAIT: > >>>>>> if (++aio_sleep_cnt > MAX_AIO_SLEEPS) > >>>>>> THROW_STORE_EXCEPTION("Timeout waiting for > AIO in > >>>>>> MessageStoreImpl::recoverMessages()"); > >>>>>> ::usleep(AIO_SLEEP_TIME_US); > >>>>>> break; > >>>>>> > >>>>>> And these are the defaults > >>>>>> > >>>>>> #define MAX_AIO_SLEEPS 100000 // tot: ~1 sec > >>>>>> > >>>>>> #define AIO_SLEEP_TIME_US 10 // 0.01 ms > >>>>>> > >>>>>> > >>>>>> RHM_IORES_PAGE_AIOWAIT, ///< IO operation suspended - next page > is > >>>>>> waiting for AIO. > >>>>>> > >>>>>> > >>>>>> > >>>>>> So does page got blocked and its waiting for page availability? > >>>>>> > >>>>>> > >>>>>> Ram > >>>>>> > >>>>>> On Mon, Nov 5, 2018 at 8:00 PM rammohan ganapavarapu < > >>>>>> [email protected]> wrote: > >>>>>> > >>>>>>> Actually we have upgraded from qpid-cpp 0.28 to 1.35 and after that > >>> we > >>>>>>> see this message > >>>>>>> > >>>>>>> 2018-10-27 18:58:25 [Store] warning Linear Store: Journal > >>>>>>> "<journal-name>": Bad record alignment found at fid=0x4605b > >>> offs=0x107680 > >>>>>>> (likely journal overwrite boundary); 19 filler record(s) required. > >>>>>>> 2018-10-27 18:58:25 [Store] notice Linear Store: Journal > >>>>>>> "<journal-name>": Recover phase write: Wrote filler record: > >>> fid=0x4605b > >>>>>>> offs=0x107680 > >>>>>>> 2018-10-27 18:58:25 [Store] notice Linear Store: Journal > >>>>>>> "<journal-name>": Recover phase write: Wr... few more Recover phase > >>> logs > >>>>>>> It worked fine for a day and started throwing this message: > >>>>>>> > >>>>>>> 2018-10-28 12:27:01 [Store] critical Linear Store: Journal > "<name>": > >>>>>>> get_events() returned JERR_JCNTL_AIOCMPLWAIT; wmgr_status: wmgr: > >>> pi=25 pc=8 > >>>>>>> po=0 aer=1 edac=TFFF ps=[-------------------------A------] > >>>>>>> 2018-10-28 12:27:01 [Broker] warning Exchange <name> cannot deliver > >>> to > >>>>>>> queue <queue-name>: Queue <queue-name>: MessageStoreImpl::store() > >>> failed: > >>>>>>> jexception 0x0202 jcntl::handle_aio_wait() threw > >>> JERR_JCNTL_AIOCMPLWAIT: > >>>>>>> Timeout waiting for AIOs to complete. > >>>>>>> > >>> > (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211) > >>>>>>> 2018-10-28 12:27:01 [Broker] error Connection exception: > >>> framing-error: > >>>>>>> Queue <queue-name>: MessageStoreImpl::store() failed: jexception > >>> 0x0202 > >>>>>>> jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout > >>> waiting for > >>>>>>> AIOs to complete. > >>>>>>> > >>> > (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211) > >>>>>>> 2018-10-28 12:27:01 [Protocol] error Connection > >>>>>>> qpid.server-ip:5672-client-ip:44457 closed by error: Queue > >>> <queue-name>: > >>>>>>> MessageStoreImpl::store() failed: jexception 0x0202 > >>>>>>> jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout > >>> waiting for > >>>>>>> AIOs to complete. > >>>>>>> > >>> > (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)(501) > >>>>>>> 2018-10-28 12:27:01 [Protocol] error Connection > >>>>>>> qpid.server-ip:5672-client-ip:44457 closed by error: > >>> illegal-argument: > >>>>>>> Value for replyText is too large(320) > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Ram > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Nov 5, 2018 at 3:34 PM rammohan ganapavarapu < > >>>>>>> [email protected]> wrote: > >>>>>>> > >>>>>>>> No, local disk. > >>>>>>>> > >>>>>>>> On Mon, Nov 5, 2018 at 3:26 PM Gordon Sim <[email protected]> > wrote: > >>>>>>>> > >>>>>>>>> On 05/11/18 22:58, rammohan ganapavarapu wrote: > >>>>>>>>>> Gordon, > >>>>>>>>>> > >>>>>>>>>> We are using java client 0.28 version and qpidd-cpp 1.35 version > >>>>>>>>>> (qpid-cpp-server-1.35.0-1.el7.x86_64), i dont know at what > >>> scenario > >>>>>>>>> its > >>>>>>>>>> happening but after i restart broker and if we wait for few days > >>> its > >>>>>>>>>> happening again. From the above logs do you have any pointers to > >>>>>>>>> check? > >>>>>>>>> > >>>>>>>>> Are you using NFS? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>> --------------------------------------------------------------------- > >>>>>>>>> To unsubscribe, e-mail: [email protected] > >>>>>>>>> For additional commands, e-mail: [email protected] > >>>>>>>>> > >>>>>>>>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: [email protected] > >>> For additional commands, e-mail: [email protected] > >>> > >>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
