Re: qpid-cpp-0.35 errors

Kim van der Riet Thu, 08 Nov 2018 11:26:51 -0800

Resending, did not show up on the list the first time I sent it...




-------- Forwarded Message --------
Subject:        Re: qpid-cpp-0.35 errors
Date:   Thu, 8 Nov 2018 09:30:24 -0500
From:   Kim van der Riet <[email protected]>
To:     [email protected]



On 11/7/18 3:18 PM, rammohan ganapavarapu wrote:

Kim,

Ok, i am still trying to see what part of my java application is causing
that issue, yes that issue is happening intermittently. Regarding
"JERR_WMGR_ENQDISCONT" error, may be they are chained exceptions from the
previous error JERR_JCNTL_AIOCMPLWAIT?

In my mind, it is more likely the other way around. But the logs shouldtell you that. It would be best to start with a clean store before eachtest so you don't inherit issues from a previous test or run.


Does message size contribute to this issue?

Yes, but only in the sense that the size alters the packing of the writebuffers, and the timing of when they are written. Also, the number ofsimultaneous producers and consumers will affect this. In particular,when two consumers are simultaneously sending messages to the samequeue, also if a consumer is consuming from a queue while a producer isalso sending are going to be the main factors in any race condition suchas I suspect this is. Playing with those will give clues as to what ishappening. You could try the following, each time starting with a cleanstore:

1. Only allowing a single producer, followed by a single consumer (ienot at the same time);

2. Allowing a single producer and a single consumer to operatesimultaneously;


3. Allowing multiple producers (I don't know if your use-case has this) only

4. Allowing multiple consumers.

Once you have isolated which scenarios cause the problem, then tryvarying the message size. The answers to these will help isolating wherethe issue is happening.


Thanks,
Ram

On Wed, Nov 7, 2018 at 11:37 AM Kim van der Riet <[email protected]>
wrote:

No, they are not.

These two defines govern the number of sleeps and the sleep time while
waiting for before throwing an exception during recovery only. They do
not play a role during normal operation.

If you are able to compile the broker code, you can try playing with
these values. But I don't think they will make much difference to the
overall problem. I think some of the other errors you have been seeing
prior to this one are closer to where the real problem lies - such as
the JRNL_WMGR_ENQDISCONT error.

Do you have a reproducer of any kind? Does this error occur predictably
under some or other conditions?

Thanks,

Kim van der Riet

On 11/7/18 12:51 PM, rammohan ganapavarapu wrote:

Kim,

I see these two settings from code, can these be configurable?

#define MAX_AIO_SLEEPS 100000 // tot: ~1 sec

#define AIO_SLEEP_TIME_US 10 // 0.01 ms


Ram

On Wed, Nov 7, 2018 at 7:04 AM rammohan ganapavarapu <
[email protected]> wrote:

Thank you Kim, i will try your suggestions.

On Wed, Nov 7, 2018, 6:58 AM Kim van der Riet <[email protected]

wrote:

This error is a linearstore issue. It looks as though there is asingle
write operation to disk that has become stuck, and is holding up all
further write operations. This happens because there is a fixed

circular

pool of memory pages used for the AIO operations to disk, and when one
of these is "busy" (indicated by the A letter in the page state map),
write operations cannot continue until it is cleared. It it does not

clear within a certain time, then an exception is thrown, whichusually

results in the broker closing the connection.

The events leading up to a "stuck" write operation are complex and
sometimes difficult to reproduce. If you have a reproducer, then I

would

be interested to see it! Even so, the ability to reproduce on another
machine is hard as it depends on such things as disk write speed, the
disk controller characteristics, the number of threads in the thread
pool (ie CPU type), memory and other hardware-related things.

There are two linearstore parameters that you can try playing with to
see if you can change the behavior of the store:

wcache-page-size: This sets the size of each page in the write buffer.
Larger page size is good for large messages, a smaller size will help

if

you have small messages.

wchache-num-pages: The total number of pages in the write buffer.

Use the --help on the broker with the linearstore loaded to see more
details on this. I hope that helps a little.

Kim van der Riet

On 11/6/18 2:12 PM, rammohan ganapavarapu wrote:

Any help in understand why/when broker throws those errors and stop
receiving message would be appreciated.

Not sure if any kernel tuning or broker tuning needs to be done to
solve this issue.

Thanks in advance,
Ram

On Tue, Nov 6, 2018 at 8:35 AM rammohan ganapavarapu <
[email protected]> wrote:

Also from this log message (store level) it seems like waiting for

AIO

to

complete.

2018-10-28 12:27:01 [Store] critical Linear Store: Journal "<journal
name>": get_events() returned JERR_JCNTL_AIOCMPLWAIT;
wmgr_status: wmgr: pi=25 pc=8 po=0 aer=1 edac=TFFF
ps=[-------------------------A------]

page_state ps=[-------------------------A------] where A is

AIO_PENDING

aer=1 _aio_evt_rem; ///< Remaining AIO events

When there is or there are pending AIO, does broker close the

connection?

is there any tuning that can be done to resolve this?

Thanks,
Ram




On Mon, Nov 5, 2018 at 8:55 PM rammohan ganapavarapu <
[email protected]> wrote:

I was check the code and i see these lines for that AIO timeout.

case

qpid::linearstore::journal::RHM_IORES_PAGE_AIOWAIT:

if (++aio_sleep_cnt > MAX_AIO_SLEEPS)
THROW_STORE_EXCEPTION("Timeout waiting for

AIO in

MessageStoreImpl::recoverMessages()");
::usleep(AIO_SLEEP_TIME_US);
break;

And these are the defaults

#define MAX_AIO_SLEEPS 100000 // tot: ~1 sec

#define AIO_SLEEP_TIME_US 10 // 0.01 ms


RHM_IORES_PAGE_AIOWAIT, ///< IO operation suspended - next page

is

waiting for AIO.



So does page got blocked and its waiting for page availability?


Ram

On Mon, Nov 5, 2018 at 8:00 PM rammohan ganapavarapu <
[email protected]> wrote:

Actually we have upgraded from qpid-cpp 0.28 to 1.35 and afterthat

we

see this message

2018-10-27 18:58:25 [Store] warning Linear Store: Journal
"<journal-name>": Bad record alignment found at fid=0x4605b

offs=0x107680

(likely journal overwrite boundary); 19 filler record(s) required.
2018-10-27 18:58:25 [Store] notice Linear Store: Journal
"<journal-name>": Recover phase write: Wrote filler record:

fid=0x4605b

offs=0x107680
2018-10-27 18:58:25 [Store] notice Linear Store: Journal
"<journal-name>": Recover phase write: Wr... few more Recoverphase

logs

It worked fine for a day and started throwing this message:

2018-10-28 12:27:01 [Store] critical Linear Store: Journal

"<name>":

get_events() returned JERR_JCNTL_AIOCMPLWAIT; wmgr_status: wmgr:

pi=25 pc=8

po=0 aer=1 edac=TFFF ps=[-------------------------A------]
2018-10-28 12:27:01 [Broker] warning Exchange <name> cannotdeliver

to

queue <queue-name>: Queue <queue-name>: MessageStoreImpl::store()

failed:

jexception 0x0202 jcntl::handle_aio_wait() threw

JERR_JCNTL_AIOCMPLWAIT:

Timeout waiting for AIOs to complete.

(/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)

2018-10-28 12:27:01 [Broker] error Connection exception:

framing-error:

Queue <queue-name>: MessageStoreImpl::store() failed: jexception

0x0202

jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout

waiting for

AIOs to complete.

(/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)

2018-10-28 12:27:01 [Protocol] error Connection
qpid.server-ip:5672-client-ip:44457 closed by error: Queue

<queue-name>:

MessageStoreImpl::store() failed: jexception 0x0202
jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout

waiting for

AIOs to complete.

(/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)(501)

2018-10-28 12:27:01 [Protocol] error Connection
qpid.server-ip:5672-client-ip:44457 closed by error:

illegal-argument:

Value for replyText is too large(320)

Thanks,
Ram


On Mon, Nov 5, 2018 at 3:34 PM rammohan ganapavarapu <
[email protected]> wrote:

No, local disk.

On Mon, Nov 5, 2018 at 3:26 PM Gordon Sim <[email protected]>

wrote:

On 05/11/18 22:58, rammohan ganapavarapu wrote:
Gordon,
We are using java client 0.28 version and qpidd-cpp 1.35version
(qpid-cpp-server-1.35.0-1.el7.x86_64), i dont know at what

scenario

its
happening but after i restart broker and if we wait for fewdays

its

happening again. From the above logs do you have anypointers to
check?

Are you using NFS?

---------------------------------------------------------------------

To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: qpid-cpp-0.35 errors

Reply via email to