Kim/Gordon, I was wrong about the NFS for qpid journal files, looks like they are on NFS, so does NFS cause this issue?
Ram On Wed, Nov 7, 2018 at 12:18 PM rammohan ganapavarapu < [email protected]> wrote: > Kim, > > Ok, i am still trying to see what part of my java application is causing > that issue, yes that issue is happening intermittently. Regarding > "JERR_WMGR_ENQDISCONT" error, may be they are chained exceptions from the > previous error JERR_JCNTL_AIOCMPLWAIT? > > Does message size contribute to this issue? > > Thanks, > Ram > > On Wed, Nov 7, 2018 at 11:37 AM Kim van der Riet <[email protected]> > wrote: > >> No, they are not. >> >> These two defines govern the number of sleeps and the sleep time while >> waiting for before throwing an exception during recovery only. They do >> not play a role during normal operation. >> >> If you are able to compile the broker code, you can try playing with >> these values. But I don't think they will make much difference to the >> overall problem. I think some of the other errors you have been seeing >> prior to this one are closer to where the real problem lies - such as >> the JRNL_WMGR_ENQDISCONT error. >> >> Do you have a reproducer of any kind? Does this error occur predictably >> under some or other conditions? >> >> Thanks, >> >> Kim van der Riet >> >> On 11/7/18 12:51 PM, rammohan ganapavarapu wrote: >> > Kim, >> > >> > I see these two settings from code, can these be configurable? >> > >> > #define MAX_AIO_SLEEPS 100000 // tot: ~1 sec >> > >> > #define AIO_SLEEP_TIME_US 10 // 0.01 ms >> > >> > >> > Ram >> > >> > On Wed, Nov 7, 2018 at 7:04 AM rammohan ganapavarapu < >> > [email protected]> wrote: >> > >> >> Thank you Kim, i will try your suggestions. >> >> >> >> On Wed, Nov 7, 2018, 6:58 AM Kim van der Riet <[email protected] >> wrote: >> >> >> >>> This error is a linearstore issue. It looks as though there is a >> single >> >>> write operation to disk that has become stuck, and is holding up all >> >>> further write operations. This happens because there is a fixed >> circular >> >>> pool of memory pages used for the AIO operations to disk, and when one >> >>> of these is "busy" (indicated by the A letter in the page state map), >> >>> write operations cannot continue until it is cleared. It it does not >> >>> clear within a certain time, then an exception is thrown, which >> usually >> >>> results in the broker closing the connection. >> >>> >> >>> The events leading up to a "stuck" write operation are complex and >> >>> sometimes difficult to reproduce. If you have a reproducer, then I >> would >> >>> be interested to see it! Even so, the ability to reproduce on another >> >>> machine is hard as it depends on such things as disk write speed, the >> >>> disk controller characteristics, the number of threads in the thread >> >>> pool (ie CPU type), memory and other hardware-related things. >> >>> >> >>> There are two linearstore parameters that you can try playing with to >> >>> see if you can change the behavior of the store: >> >>> >> >>> wcache-page-size: This sets the size of each page in the write buffer. >> >>> Larger page size is good for large messages, a smaller size will help >> if >> >>> you have small messages. >> >>> >> >>> wchache-num-pages: The total number of pages in the write buffer. >> >>> >> >>> Use the --help on the broker with the linearstore loaded to see more >> >>> details on this. I hope that helps a little. >> >>> >> >>> Kim van der Riet >> >>> >> >>> On 11/6/18 2:12 PM, rammohan ganapavarapu wrote: >> >>>> Any help in understand why/when broker throws those errors and stop >> >>>> receiving message would be appreciated. >> >>>> >> >>>> Not sure if any kernel tuning or broker tuning needs to be done to >> >>>> solve this issue. >> >>>> >> >>>> Thanks in advance, >> >>>> Ram >> >>>> >> >>>> On Tue, Nov 6, 2018 at 8:35 AM rammohan ganapavarapu < >> >>>> [email protected]> wrote: >> >>>> >> >>>>> Also from this log message (store level) it seems like waiting for >> AIO >> >>> to >> >>>>> complete. >> >>>>> >> >>>>> 2018-10-28 12:27:01 [Store] critical Linear Store: Journal "<journal >> >>>>> name>": get_events() returned JERR_JCNTL_AIOCMPLWAIT; >> >>>>> wmgr_status: wmgr: pi=25 pc=8 po=0 aer=1 edac=TFFF >> >>>>> ps=[-------------------------A------] >> >>>>> >> >>>>> page_state ps=[-------------------------A------] where A is >> >>> AIO_PENDING >> >>>>> aer=1 _aio_evt_rem; ///< Remaining AIO events >> >>>>> >> >>>>> When there is or there are pending AIO, does broker close the >> >>> connection? >> >>>>> is there any tuning that can be done to resolve this? >> >>>>> >> >>>>> Thanks, >> >>>>> Ram >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> On Mon, Nov 5, 2018 at 8:55 PM rammohan ganapavarapu < >> >>>>> [email protected]> wrote: >> >>>>> >> >>>>>> I was check the code and i see these lines for that AIO timeout. >> >>>>>> >> >>>>>> case >> >>> qpid::linearstore::journal::RHM_IORES_PAGE_AIOWAIT: >> >>>>>> if (++aio_sleep_cnt > MAX_AIO_SLEEPS) >> >>>>>> THROW_STORE_EXCEPTION("Timeout waiting for >> AIO in >> >>>>>> MessageStoreImpl::recoverMessages()"); >> >>>>>> ::usleep(AIO_SLEEP_TIME_US); >> >>>>>> break; >> >>>>>> >> >>>>>> And these are the defaults >> >>>>>> >> >>>>>> #define MAX_AIO_SLEEPS 100000 // tot: ~1 sec >> >>>>>> >> >>>>>> #define AIO_SLEEP_TIME_US 10 // 0.01 ms >> >>>>>> >> >>>>>> >> >>>>>> RHM_IORES_PAGE_AIOWAIT, ///< IO operation suspended - next >> page is >> >>>>>> waiting for AIO. >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> So does page got blocked and its waiting for page availability? >> >>>>>> >> >>>>>> >> >>>>>> Ram >> >>>>>> >> >>>>>> On Mon, Nov 5, 2018 at 8:00 PM rammohan ganapavarapu < >> >>>>>> [email protected]> wrote: >> >>>>>> >> >>>>>>> Actually we have upgraded from qpid-cpp 0.28 to 1.35 and after >> that >> >>> we >> >>>>>>> see this message >> >>>>>>> >> >>>>>>> 2018-10-27 18:58:25 [Store] warning Linear Store: Journal >> >>>>>>> "<journal-name>": Bad record alignment found at fid=0x4605b >> >>> offs=0x107680 >> >>>>>>> (likely journal overwrite boundary); 19 filler record(s) required. >> >>>>>>> 2018-10-27 18:58:25 [Store] notice Linear Store: Journal >> >>>>>>> "<journal-name>": Recover phase write: Wrote filler record: >> >>> fid=0x4605b >> >>>>>>> offs=0x107680 >> >>>>>>> 2018-10-27 18:58:25 [Store] notice Linear Store: Journal >> >>>>>>> "<journal-name>": Recover phase write: Wr... few more Recover >> phase >> >>> logs >> >>>>>>> It worked fine for a day and started throwing this message: >> >>>>>>> >> >>>>>>> 2018-10-28 12:27:01 [Store] critical Linear Store: Journal >> "<name>": >> >>>>>>> get_events() returned JERR_JCNTL_AIOCMPLWAIT; wmgr_status: wmgr: >> >>> pi=25 pc=8 >> >>>>>>> po=0 aer=1 edac=TFFF ps=[-------------------------A------] >> >>>>>>> 2018-10-28 12:27:01 [Broker] warning Exchange <name> cannot >> deliver >> >>> to >> >>>>>>> queue <queue-name>: Queue <queue-name>: MessageStoreImpl::store() >> >>> failed: >> >>>>>>> jexception 0x0202 jcntl::handle_aio_wait() threw >> >>> JERR_JCNTL_AIOCMPLWAIT: >> >>>>>>> Timeout waiting for AIOs to complete. >> >>>>>>> >> >>> >> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211) >> >>>>>>> 2018-10-28 12:27:01 [Broker] error Connection exception: >> >>> framing-error: >> >>>>>>> Queue <queue-name>: MessageStoreImpl::store() failed: jexception >> >>> 0x0202 >> >>>>>>> jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout >> >>> waiting for >> >>>>>>> AIOs to complete. >> >>>>>>> >> >>> >> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211) >> >>>>>>> 2018-10-28 12:27:01 [Protocol] error Connection >> >>>>>>> qpid.server-ip:5672-client-ip:44457 closed by error: Queue >> >>> <queue-name>: >> >>>>>>> MessageStoreImpl::store() failed: jexception 0x0202 >> >>>>>>> jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout >> >>> waiting for >> >>>>>>> AIOs to complete. >> >>>>>>> >> >>> >> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)(501) >> >>>>>>> 2018-10-28 12:27:01 [Protocol] error Connection >> >>>>>>> qpid.server-ip:5672-client-ip:44457 closed by error: >> >>> illegal-argument: >> >>>>>>> Value for replyText is too large(320) >> >>>>>>> >> >>>>>>> Thanks, >> >>>>>>> Ram >> >>>>>>> >> >>>>>>> >> >>>>>>> On Mon, Nov 5, 2018 at 3:34 PM rammohan ganapavarapu < >> >>>>>>> [email protected]> wrote: >> >>>>>>> >> >>>>>>>> No, local disk. >> >>>>>>>> >> >>>>>>>> On Mon, Nov 5, 2018 at 3:26 PM Gordon Sim <[email protected]> >> wrote: >> >>>>>>>> >> >>>>>>>>> On 05/11/18 22:58, rammohan ganapavarapu wrote: >> >>>>>>>>>> Gordon, >> >>>>>>>>>> >> >>>>>>>>>> We are using java client 0.28 version and qpidd-cpp 1.35 >> version >> >>>>>>>>>> (qpid-cpp-server-1.35.0-1.el7.x86_64), i dont know at what >> >>> scenario >> >>>>>>>>> its >> >>>>>>>>>> happening but after i restart broker and if we wait for few >> days >> >>> its >> >>>>>>>>>> happening again. From the above logs do you have any pointers >> to >> >>>>>>>>> check? >> >>>>>>>>> >> >>>>>>>>> Are you using NFS? >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>> --------------------------------------------------------------------- >> >>>>>>>>> To unsubscribe, e-mail: [email protected] >> >>>>>>>>> For additional commands, e-mail: [email protected] >> >>>>>>>>> >> >>>>>>>>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: [email protected] >> >>> For additional commands, e-mail: [email protected] >> >>> >> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >>
