This error is a linearstore issue. It looks as though there is a single
write operation to disk that has become stuck, and is holding up all
further write operations. This happens because there is a fixed circular
pool of memory pages used for the AIO operations to disk, and when one
of these is "busy" (indicated by the A letter in theĀ page state map),
write operations cannot continue until it is cleared. It it does not
clear within a certain time, then an exception is thrown, which usually
results in the broker closing the connection.
The events leading up to a "stuck" write operation are complex and
sometimes difficult to reproduce. If you have a reproducer, then I would
be interested to see it! Even so, the ability to reproduce on another
machine is hard as it depends on such things as disk write speed, the
disk controller characteristics, the number of threads in the thread
pool (ie CPU type), memory and other hardware-related things.
There are two linearstore parameters that you can try playing with to
see if you can change the behavior of the store:
wcache-page-size: This sets the size of each page in the write buffer.
Larger page size is good for large messages, a smaller size will help if
you have small messages.
wchache-num-pages: The total number of pages in the write buffer.
Use the --help on the broker with the linearstore loaded to see more
details on this. I hope that helps a little.
Kim van der Riet
On 11/6/18 2:12 PM, rammohan ganapavarapu wrote:
Any help in understand why/when broker throws those errors and stop
receiving message would be appreciated.
Not sure if any kernel tuning or broker tuning needs to be done to
solve this issue.
Thanks in advance,
Ram
On Tue, Nov 6, 2018 at 8:35 AM rammohan ganapavarapu <
[email protected]> wrote:
Also from this log message (store level) it seems like waiting for AIO to
complete.
2018-10-28 12:27:01 [Store] critical Linear Store: Journal "<journal
name>": get_events() returned JERR_JCNTL_AIOCMPLWAIT;
wmgr_status: wmgr: pi=25 pc=8 po=0 aer=1 edac=TFFF
ps=[-------------------------A------]
page_state ps=[-------------------------A------] where A is AIO_PENDING
aer=1 _aio_evt_rem; ///< Remaining AIO events
When there is or there are pending AIO, does broker close the connection?
is there any tuning that can be done to resolve this?
Thanks,
Ram
On Mon, Nov 5, 2018 at 8:55 PM rammohan ganapavarapu <
[email protected]> wrote:
I was check the code and i see these lines for that AIO timeout.
case qpid::linearstore::journal::RHM_IORES_PAGE_AIOWAIT:
if (++aio_sleep_cnt > MAX_AIO_SLEEPS)
THROW_STORE_EXCEPTION("Timeout waiting for AIO in
MessageStoreImpl::recoverMessages()");
::usleep(AIO_SLEEP_TIME_US);
break;
And these are the defaults
#define MAX_AIO_SLEEPS 100000 // tot: ~1 sec
#define AIO_SLEEP_TIME_US 10 // 0.01 ms
RHM_IORES_PAGE_AIOWAIT, ///< IO operation suspended - next page is
waiting for AIO.
So does page got blocked and its waiting for page availability?
Ram
On Mon, Nov 5, 2018 at 8:00 PM rammohan ganapavarapu <
[email protected]> wrote:
Actually we have upgraded from qpid-cpp 0.28 to 1.35 and after that we
see this message
2018-10-27 18:58:25 [Store] warning Linear Store: Journal
"<journal-name>": Bad record alignment found at fid=0x4605b offs=0x107680
(likely journal overwrite boundary); 19 filler record(s) required.
2018-10-27 18:58:25 [Store] notice Linear Store: Journal
"<journal-name>": Recover phase write: Wrote filler record: fid=0x4605b
offs=0x107680
2018-10-27 18:58:25 [Store] notice Linear Store: Journal
"<journal-name>": Recover phase write: Wr... few more Recover phase logs
It worked fine for a day and started throwing this message:
2018-10-28 12:27:01 [Store] critical Linear Store: Journal "<name>":
get_events() returned JERR_JCNTL_AIOCMPLWAIT; wmgr_status: wmgr: pi=25 pc=8
po=0 aer=1 edac=TFFF ps=[-------------------------A------]
2018-10-28 12:27:01 [Broker] warning Exchange <name> cannot deliver to
queue <queue-name>: Queue <queue-name>: MessageStoreImpl::store() failed:
jexception 0x0202 jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT:
Timeout waiting for AIOs to complete.
(/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)
2018-10-28 12:27:01 [Broker] error Connection exception: framing-error:
Queue <queue-name>: MessageStoreImpl::store() failed: jexception 0x0202
jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout waiting for
AIOs to complete.
(/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)
2018-10-28 12:27:01 [Protocol] error Connection
qpid.server-ip:5672-client-ip:44457 closed by error: Queue <queue-name>:
MessageStoreImpl::store() failed: jexception 0x0202
jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout waiting for
AIOs to complete.
(/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)(501)
2018-10-28 12:27:01 [Protocol] error Connection
qpid.server-ip:5672-client-ip:44457 closed by error: illegal-argument:
Value for replyText is too large(320)
Thanks,
Ram
On Mon, Nov 5, 2018 at 3:34 PM rammohan ganapavarapu <
[email protected]> wrote:
No, local disk.
On Mon, Nov 5, 2018 at 3:26 PM Gordon Sim <[email protected]> wrote:
On 05/11/18 22:58, rammohan ganapavarapu wrote:
Gordon,
We are using java client 0.28 version and qpidd-cpp 1.35 version
(qpid-cpp-server-1.35.0-1.el7.x86_64), i dont know at what scenario
its
happening but after i restart broker and if we wait for few days its
happening again. From the above logs do you have any pointers to
check?
Are you using NFS?
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]