Thank you, yes we have multiple produces and multiple consumes accessing the same queue at same time. My message size is around ~80k. Small confusion regarding AIO, if its AIO it shouldn't have io wait right? why is it waiting?
Was reading about aio and found this, i am hitting this scenario? "AIO read and write on files opened without O_DIRECT (i.e. normal buffered filesystem AIO). On ext2, ext3, jfs, xfs and nfs, these do not return an explicit error, but quietly default to sync" as in <goog_793797107> http://lse.sourceforge.net/io/aio.html We have enabled trace loging and here is the full exception stack, see if it helps you to identify the issue. 2018-11-06 18:04:33 [Broker] debug clean(): 10 messages remain; head is now 0 2018-11-06 18:04:38 [Store] critical Linear Store: Journal "ax-q-axgroup-001-consumer-group-001": get_events() returned JERR_JCNTL_AIOCMPLWAIT; wmgr_status: wmgr: pi=31 pc=2 po=0 aer=1 edac=TFFF ps=[-------------------------------A] 2018-11-06 18:04:38 [System] debug Exception constructed: Queue <queue-name>: async_dequeue() failed: jexception 0x0000 (RHM_IORES_BUSY: dequeue while part way through another op: _enq_busy=T _abort_busy=F _commit_busy=F) (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1272) 2018-11-06 18:04:38 [System] debug Exception constructed: Queue <queue-name>: async_dequeue() failed: jexception 0x0000 (RHM_IORES_BUSY: dequeue while part way through another op: _enq_busy=T _abort_busy=F _commit_busy=F) (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1272) 2018-11-06 18:04:38 [System] debug Exception constructed: Queue <queue-name>: MessageStoreImpl::store() failed: jexception 0x0202 jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout waiting for AIOs to complete. (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211) 2018-11-06 18:04:39 [System] debug Exception constructed: Queue <queue-name>: MessageStoreImpl::store() failed: jexception 0x0202 jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout waiting for AIOs to complete. (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211) 2018-11-06 18:04:39 [Broker] warning Exchange <exchange-name> cannot deliver to Queue <queue-name>: Queue <queue-name>: MessageStoreImpl::store() failed: jexception 0x0202 jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout waiting for AIOs to complete. (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211) 2018-11-06 18:04:39 [System] debug Exception constructed: Queue <queue-name>: async_dequeue() failed: jexception 0x0000 (RHM_IORES_BUSY: dequeue while part way through another op: _enq_busy=T _abort_busy=F _commit_busy=F) (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1272) 2018-11-06 18:04:39 [System] debug Exception constructed: Queue <queue-name>: async_dequeue() failed: jexception 0x0000 (RHM_IORES_BUSY: dequeue while part way through another op: _enq_busy=T _abort_busy=F _commit_busy=F) (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1272) 2018-11-06 18:04:39 [Broker] error Connection exception: framing-error: Queue <queue-name>: async_dequeue() failed: jexception 0x0000 (RHM_IORES_BUSY: dequeue while part way through another op: _enq_busy=T _abort_busy=F _commit_busy=F) (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1272) 2018-11-06 18:04:39 [Broker] error Connection exception: framing-error: Queue <queue-name>: async_dequeue() failed: jexception 0x0000 (RHM_IORES_BUSY: dequeue while part way through another op: _enq_busy=T _abort_busy=F _commit_busy=F) (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1272) 2018-11-06 18:04:39 [Protocol] error Connection qpid.<server>:5672-<server>:33484 closed by error: Queue <queue-name>: async_dequeue() failed: jexception 0x0000 (RHM_IORES_BUSY: dequeue while part way through another op: _enq_busy=T _abort_busy=F _commit_busy=F) (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1272)(501) 2018-11-06 18:04:39 [Protocol] error Connection qpid.<server>:5672-<client>:40333 closed by error: Queue <queue-name>: async_dequeue() failed: jexception 0x0000 (RHM_IORES_BUSY: dequeue while part way through another op: _enq_busy=T _abort_busy=F _commit_busy=F) (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1272)(501) 2018-11-06 18:04:39 [System] debug Exception constructed: Value for replyText is too large 2018-11-06 18:04:39 [Protocol] error Connection qpid.<server>:5672-<client>:40333 closed by error: illegal-argument: Value for replyText is too large(320) Thanks a lot for your help, Ram On Thu, Nov 8, 2018 at 11:26 AM Kim van der Riet <[email protected]> wrote: > Resending, did not show up on the list the first time I sent it... > > > > -------- Forwarded Message -------- > Subject: Re: qpid-cpp-0.35 errors > Date: Thu, 8 Nov 2018 09:30:24 -0500 > From: Kim van der Riet <[email protected]> > To: [email protected] > > > > On 11/7/18 3:18 PM, rammohan ganapavarapu wrote: > > Kim, > > > > Ok, i am still trying to see what part of my java application is causing > > that issue, yes that issue is happening intermittently. Regarding > > "JERR_WMGR_ENQDISCONT" error, may be they are chained exceptions from the > > previous error JERR_JCNTL_AIOCMPLWAIT? > In my mind, it is more likely the other way around. But the logs should > tell you that. It would be best to start with a clean store before each > test so you don't inherit issues from a previous test or run. > > > > Does message size contribute to this issue? > > Yes, but only in the sense that the size alters the packing of the write > buffers, and the timing of when they are written. Also, the number of > simultaneous producers and consumers will affect this. In particular, > when two consumers are simultaneously sending messages to the same > queue, also if a consumer is consuming from a queue while a producer is > also sending are going to be the main factors in any race condition such > as I suspect this is. Playing with those will give clues as to what is > happening. You could try the following, each time starting with a clean > store: > > 1. Only allowing a single producer, followed by a single consumer (ie > not at the same time); > > 2. Allowing a single producer and a single consumer to operate > simultaneously; > > 3. Allowing multiple producers (I don't know if your use-case has this) > only > > 4. Allowing multiple consumers. > > Once you have isolated which scenarios cause the problem, then try > varying the message size. The answers to these will help isolating where > the issue is happening. > > > > > Thanks, > > Ram > > > > On Wed, Nov 7, 2018 at 11:37 AM Kim van der Riet <[email protected]> > > wrote: > > > >> No, they are not. > >> > >> These two defines govern the number of sleeps and the sleep time while > >> waiting for before throwing an exception during recovery only. They do > >> not play a role during normal operation. > >> > >> If you are able to compile the broker code, you can try playing with > >> these values. But I don't think they will make much difference to the > >> overall problem. I think some of the other errors you have been seeing > >> prior to this one are closer to where the real problem lies - such as > >> the JRNL_WMGR_ENQDISCONT error. > >> > >> Do you have a reproducer of any kind? Does this error occur predictably > >> under some or other conditions? > >> > >> Thanks, > >> > >> Kim van der Riet > >> > >> On 11/7/18 12:51 PM, rammohan ganapavarapu wrote: > >>> Kim, > >>> > >>> I see these two settings from code, can these be configurable? > >>> > >>> #define MAX_AIO_SLEEPS 100000 // tot: ~1 sec > >>> > >>> #define AIO_SLEEP_TIME_US 10 // 0.01 ms > >>> > >>> > >>> Ram > >>> > >>> On Wed, Nov 7, 2018 at 7:04 AM rammohan ganapavarapu < > >>> [email protected]> wrote: > >>> > >>>> Thank you Kim, i will try your suggestions. > >>>> > >>>> On Wed, Nov 7, 2018, 6:58 AM Kim van der Riet <[email protected] > >> wrote: > >>>>> This error is a linearstore issue. It looks as though there is a > >>>>> single > >>>>> write operation to disk that has become stuck, and is holding up all > >>>>> further write operations. This happens because there is a fixed > >> circular > >>>>> pool of memory pages used for the AIO operations to disk, and when > one > >>>>> of these is "busy" (indicated by the A letter in the page state map), > >>>>> write operations cannot continue until it is cleared. It it does not > >>>>> clear within a certain time, then an exception is thrown, which > >>>>> usually > >>>>> results in the broker closing the connection. > >>>>> > >>>>> The events leading up to a "stuck" write operation are complex and > >>>>> sometimes difficult to reproduce. If you have a reproducer, then I > >> would > >>>>> be interested to see it! Even so, the ability to reproduce on another > >>>>> machine is hard as it depends on such things as disk write speed, the > >>>>> disk controller characteristics, the number of threads in the thread > >>>>> pool (ie CPU type), memory and other hardware-related things. > >>>>> > >>>>> There are two linearstore parameters that you can try playing with to > >>>>> see if you can change the behavior of the store: > >>>>> > >>>>> wcache-page-size: This sets the size of each page in the write > buffer. > >>>>> Larger page size is good for large messages, a smaller size will help > >> if > >>>>> you have small messages. > >>>>> > >>>>> wchache-num-pages: The total number of pages in the write buffer. > >>>>> > >>>>> Use the --help on the broker with the linearstore loaded to see more > >>>>> details on this. I hope that helps a little. > >>>>> > >>>>> Kim van der Riet > >>>>> > >>>>> On 11/6/18 2:12 PM, rammohan ganapavarapu wrote: > >>>>>> Any help in understand why/when broker throws those errors and stop > >>>>>> receiving message would be appreciated. > >>>>>> > >>>>>> Not sure if any kernel tuning or broker tuning needs to be done to > >>>>>> solve this issue. > >>>>>> > >>>>>> Thanks in advance, > >>>>>> Ram > >>>>>> > >>>>>> On Tue, Nov 6, 2018 at 8:35 AM rammohan ganapavarapu < > >>>>>> [email protected]> wrote: > >>>>>> > >>>>>>> Also from this log message (store level) it seems like waiting for > >> AIO > >>>>> to > >>>>>>> complete. > >>>>>>> > >>>>>>> 2018-10-28 12:27:01 [Store] critical Linear Store: Journal > "<journal > >>>>>>> name>": get_events() returned JERR_JCNTL_AIOCMPLWAIT; > >>>>>>> wmgr_status: wmgr: pi=25 pc=8 po=0 aer=1 edac=TFFF > >>>>>>> ps=[-------------------------A------] > >>>>>>> > >>>>>>> page_state ps=[-------------------------A------] where A is > >>>>> AIO_PENDING > >>>>>>> aer=1 _aio_evt_rem; ///< Remaining AIO events > >>>>>>> > >>>>>>> When there is or there are pending AIO, does broker close the > >>>>> connection? > >>>>>>> is there any tuning that can be done to resolve this? > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Ram > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Nov 5, 2018 at 8:55 PM rammohan ganapavarapu < > >>>>>>> [email protected]> wrote: > >>>>>>> > >>>>>>>> I was check the code and i see these lines for that AIO timeout. > >>>>>>>> > >>>>>>>> case > >>>>> qpid::linearstore::journal::RHM_IORES_PAGE_AIOWAIT: > >>>>>>>> if (++aio_sleep_cnt > MAX_AIO_SLEEPS) > >>>>>>>> THROW_STORE_EXCEPTION("Timeout waiting for > >> AIO in > >>>>>>>> MessageStoreImpl::recoverMessages()"); > >>>>>>>> ::usleep(AIO_SLEEP_TIME_US); > >>>>>>>> break; > >>>>>>>> > >>>>>>>> And these are the defaults > >>>>>>>> > >>>>>>>> #define MAX_AIO_SLEEPS 100000 // tot: ~1 sec > >>>>>>>> > >>>>>>>> #define AIO_SLEEP_TIME_US 10 // 0.01 ms > >>>>>>>> > >>>>>>>> > >>>>>>>> RHM_IORES_PAGE_AIOWAIT, ///< IO operation suspended - next page > >> is > >>>>>>>> waiting for AIO. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> So does page got blocked and its waiting for page availability? > >>>>>>>> > >>>>>>>> > >>>>>>>> Ram > >>>>>>>> > >>>>>>>> On Mon, Nov 5, 2018 at 8:00 PM rammohan ganapavarapu < > >>>>>>>> [email protected]> wrote: > >>>>>>>> > >>>>>>>>> Actually we have upgraded from qpid-cpp 0.28 to 1.35 and after > >>>>>>>>> that > >>>>> we > >>>>>>>>> see this message > >>>>>>>>> > >>>>>>>>> 2018-10-27 18:58:25 [Store] warning Linear Store: Journal > >>>>>>>>> "<journal-name>": Bad record alignment found at fid=0x4605b > >>>>> offs=0x107680 > >>>>>>>>> (likely journal overwrite boundary); 19 filler record(s) > required. > >>>>>>>>> 2018-10-27 18:58:25 [Store] notice Linear Store: Journal > >>>>>>>>> "<journal-name>": Recover phase write: Wrote filler record: > >>>>> fid=0x4605b > >>>>>>>>> offs=0x107680 > >>>>>>>>> 2018-10-27 18:58:25 [Store] notice Linear Store: Journal > >>>>>>>>> "<journal-name>": Recover phase write: Wr... few more Recover > >>>>>>>>> phase > >>>>> logs > >>>>>>>>> It worked fine for a day and started throwing this message: > >>>>>>>>> > >>>>>>>>> 2018-10-28 12:27:01 [Store] critical Linear Store: Journal > >> "<name>": > >>>>>>>>> get_events() returned JERR_JCNTL_AIOCMPLWAIT; wmgr_status: wmgr: > >>>>> pi=25 pc=8 > >>>>>>>>> po=0 aer=1 edac=TFFF ps=[-------------------------A------] > >>>>>>>>> 2018-10-28 12:27:01 [Broker] warning Exchange <name> cannot > >>>>>>>>> deliver > >>>>> to > >>>>>>>>> queue <queue-name>: Queue <queue-name>: MessageStoreImpl::store() > >>>>> failed: > >>>>>>>>> jexception 0x0202 jcntl::handle_aio_wait() threw > >>>>> JERR_JCNTL_AIOCMPLWAIT: > >>>>>>>>> Timeout waiting for AIOs to complete. > >>>>>>>>> > >> > (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211) > >>>>>>>>> 2018-10-28 12:27:01 [Broker] error Connection exception: > >>>>> framing-error: > >>>>>>>>> Queue <queue-name>: MessageStoreImpl::store() failed: jexception > >>>>> 0x0202 > >>>>>>>>> jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout > >>>>> waiting for > >>>>>>>>> AIOs to complete. > >>>>>>>>> > >> > (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211) > >>>>>>>>> 2018-10-28 12:27:01 [Protocol] error Connection > >>>>>>>>> qpid.server-ip:5672-client-ip:44457 closed by error: Queue > >>>>> <queue-name>: > >>>>>>>>> MessageStoreImpl::store() failed: jexception 0x0202 > >>>>>>>>> jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout > >>>>> waiting for > >>>>>>>>> AIOs to complete. > >>>>>>>>> > >> > (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)(501) > >>>>>>>>> 2018-10-28 12:27:01 [Protocol] error Connection > >>>>>>>>> qpid.server-ip:5672-client-ip:44457 closed by error: > >>>>> illegal-argument: > >>>>>>>>> Value for replyText is too large(320) > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Ram > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Mon, Nov 5, 2018 at 3:34 PM rammohan ganapavarapu < > >>>>>>>>> [email protected]> wrote: > >>>>>>>>> > >>>>>>>>>> No, local disk. > >>>>>>>>>> > >>>>>>>>>> On Mon, Nov 5, 2018 at 3:26 PM Gordon Sim <[email protected]> > >> wrote: > >>>>>>>>>>> On 05/11/18 22:58, rammohan ganapavarapu wrote: > >>>>>>>>>>>> Gordon, > >>>>>>>>>>>> > >>>>>>>>>>>> We are using java client 0.28 version and qpidd-cpp 1.35 > >>>>>>>>>>>> version > >>>>>>>>>>>> (qpid-cpp-server-1.35.0-1.el7.x86_64), i dont know at what > >>>>> scenario > >>>>>>>>>>> its > >>>>>>>>>>>> happening but after i restart broker and if we wait for few > >>>>>>>>>>>> days > >>>>> its > >>>>>>>>>>>> happening again. From the above logs do you have any > >>>>>>>>>>>> pointers to > >>>>>>>>>>> check? > >>>>>>>>>>> > >>>>>>>>>>> Are you using NFS? > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>> --------------------------------------------------------------------- > >>>>>>>>>>> To unsubscribe, e-mail: [email protected] > >>>>>>>>>>> For additional commands, e-mail: [email protected] > >>>>>>>>>>> > >>>>>>>>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: [email protected] > >>>>> For additional commands, e-mail: [email protected] > >>>>> > >>>>> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > >> >
