Re: qpid-cpp-0.35 errors

2018-11-08 Thread rammohan ganapavarapu
Kim/Gordon,

I was wrong about the NFS for qpid journal files, looks like they are on
NFS, so does NFS cause this issue?

Ram

On Wed, Nov 7, 2018 at 12:18 PM rammohan ganapavarapu <
rammohanga...@gmail.com> wrote:

> Kim,
>
> Ok, i am still trying to see what part of my java application is causing
> that issue, yes that issue is happening intermittently. Regarding
> "JERR_WMGR_ENQDISCONT" error, may be they are chained exceptions from the
> previous error JERR_JCNTL_AIOCMPLWAIT?
>
> Does message size contribute to this issue?
>
> Thanks,
> Ram
>
> On Wed, Nov 7, 2018 at 11:37 AM Kim van der Riet 
> wrote:
>
>> No, they are not.
>>
>> These two defines govern the number of sleeps and the sleep time while
>> waiting for before throwing an exception during recovery only. They do
>> not play a role during normal operation.
>>
>> If you are able to compile the broker code, you can try playing with
>> these values. But I don't think they will make much difference to the
>> overall problem. I think some of the other errors you have been seeing
>> prior to this one are closer to where the real problem lies - such as
>> the JRNL_WMGR_ENQDISCONT error.
>>
>> Do you have a reproducer of any kind? Does this error occur predictably
>> under some or other conditions?
>>
>> Thanks,
>>
>> Kim van der Riet
>>
>> On 11/7/18 12:51 PM, rammohan ganapavarapu wrote:
>> > Kim,
>> >
>> > I see these two settings from code, can these be configurable?
>> >
>> > #define MAX_AIO_SLEEPS 10 // tot: ~1 sec
>> >
>> > #define AIO_SLEEP_TIME_US  10 // 0.01 ms
>> >
>> >
>> > Ram
>> >
>> > On Wed, Nov 7, 2018 at 7:04 AM rammohan ganapavarapu <
>> > rammohanga...@gmail.com> wrote:
>> >
>> >> Thank you Kim, i will try your suggestions.
>> >>
>> >> On Wed, Nov 7, 2018, 6:58 AM Kim van der Riet > wrote:
>> >>
>> >>> This error is a linearstore issue. It looks as though there is a
>> single
>> >>> write operation to disk that has become stuck, and is holding up all
>> >>> further write operations. This happens because there is a fixed
>> circular
>> >>> pool of memory pages used for the AIO operations to disk, and when one
>> >>> of these is "busy" (indicated by the A letter in the  page state map),
>> >>> write operations cannot continue until it is cleared. It it does not
>> >>> clear within a certain time, then an exception is thrown, which
>> usually
>> >>> results in the broker closing the connection.
>> >>>
>> >>> The events leading up to a "stuck" write operation are complex and
>> >>> sometimes difficult to reproduce. If you have a reproducer, then I
>> would
>> >>> be interested to see it! Even so, the ability to reproduce on another
>> >>> machine is hard as it depends on such things as disk write speed, the
>> >>> disk controller characteristics, the number of threads in the thread
>> >>> pool (ie CPU type), memory and other hardware-related things.
>> >>>
>> >>> There are two linearstore parameters that you can try playing with to
>> >>> see if you can change the behavior of the store:
>> >>>
>> >>> wcache-page-size: This sets the size of each page in the write buffer.
>> >>> Larger page size is good for large messages, a smaller size will help
>> if
>> >>> you have small messages.
>> >>>
>> >>> wchache-num-pages: The total number of pages in the write buffer.
>> >>>
>> >>> Use the --help on the broker with the linearstore loaded to see more
>> >>> details on this. I hope that helps a little.
>> >>>
>> >>> Kim van der Riet
>> >>>
>> >>> On 11/6/18 2:12 PM, rammohan ganapavarapu wrote:
>>  Any help in understand why/when broker throws those errors and stop
>>  receiving message would be appreciated.
>> 
>>  Not sure if any kernel tuning or broker tuning needs to be done to
>>  solve this issue.
>> 
>>  Thanks in advance,
>>  Ram
>> 
>>  On Tue, Nov 6, 2018 at 8:35 AM rammohan ganapavarapu <
>>  rammohanga...@gmail.com> wrote:
>> 
>> > Also from this log message (store level) it seems like waiting for
>> AIO
>> >>> to
>> > complete.
>> >
>> > 2018-10-28 12:27:01 [Store] critical Linear Store: Journal "> > name>": get_events() returned JERR_JCNTL_AIOCMPLWAIT;
>> > wmgr_status: wmgr: pi=25 pc=8 po=0 aer=1 edac=TFFF
>> > ps=[-A--]
>> >
>> > page_state ps=[-A--]  where A is
>> >>> AIO_PENDING
>> > aer=1 _aio_evt_rem;  ///< Remaining AIO events
>> >
>> > When there is or there are pending AIO, does broker close the
>> >>> connection?
>> > is there any tuning that can be done to resolve this?
>> >
>> > Thanks,
>> > Ram
>> >
>> >
>> >
>> >
>> > On Mon, Nov 5, 2018 at 8:55 PM rammohan ganapavarapu <
>> > rammohanga...@gmail.com> wrote:
>> >
>> >> I was check the code and i see these lines for that AIO timeout.
>> >>
>> >> case
>> >>> qpid::linearstore::journal::RHM_IORES_PAGE_AIOWAIT:
>> >>  

Re: [VOTE] Release Apache Qpid Proton-J 0.30.0

2018-11-08 Thread Keith W
+1.

My testing was:

* Verified signatures and checksums
* Checked for LICENCE and NOTICE files in the archives.
* Built from source / ran tests on Mac OS X 10.13.6
* Ran Qpid Broker-J client integration tests using the staged proton
artefacts and Qpid-JMS (master - 9c1afa9b3)
* Ran Qpid JMS test suite (master - 9c1afa9b3) using the staged proton artefacts
On Wed, 7 Nov 2018 at 16:33, Oleksandr Rudyy  wrote:
>
> +1
>
> Robbie, thanks for the detailed explanation about the nature of the
> issue with qpid-jms-client 0.36.0.
> I was able to run successfully Broker-J integration tests with
> qpid-jms-client 0.37.0 and proton-j 0.30.0 RC.
> Apart from running Qpid Broker-J integration tests I  verified
> signatures/checksums and built and ran tests successfully from
> proton-j 0.30.0 sources.
> On Wed, 7 Nov 2018 at 15:27, Robbie Gemmell  wrote:
> >
> > After some investigation I found this actually stems from a bug in the
> > older 0.36.0 qpid-jms client rather than in the proton-j 0.30.0 RC.
> > The client had a bug in its implementation of a buffer interface, one
> > which proton-j is now making use of in a way that exposes that bug in
> > the older client. The particular bug was already fixed in 0.37.0, so
> > using 0.37.0 or the 0.38.0-SNAPSHOT passes those tests, as did a
> > modified 0.36.0.
> >
> > While it might have been nice to avoid this until a future point,
> > given it was a client bug and the situation doesnt arise with the
> > current release I believe we should proceed as-is.
> >
> > Robbie
> >
> > On Wed, 7 Nov 2018 at 10:12, Oleksandr Rudyy  wrote:
> > >
> > > Hi,
> > >
> > > I tried to run Qpid Broker-J system tests with qpid-jms-client 0.36.0
> > > and staged org.apache.qpid:proton-j:0.30.0 but JMS 1.1 tests from
> > > suite org.apache.qpid.systests.jms_1_1.message.LargeMessageTest failed
> > > due to mismatch of received message text and sent message text.
> > > The text messages with sizes 245KB, 512KM and 1MB are sent and
> > > received as part of the test suite.
> > > Somehow  some characters at the end of the received messages have been
> > > replaced with '\u0' characters. The tests are passing with proton-j
> > > 0.29.0 and qpid-jms-client 0.36.0.
> > >
> > > Kind Regards,
> > > Alex
> > > On Tue, 6 Nov 2018 at 18:15, Robbie Gemmell  
> > > wrote:
> > > >
> > > > Hi folks,
> > > >
> > > > I have put together a spin for a Qpid Proton-J 0.30.0 release, please
> > > > test it and vote accordingly.
> > > >
> > > > The files can be grabbed from:
> > > > https://dist.apache.org/repos/dist/dev/qpid/proton-j/0.30.0-rc1/
> > > >
> > > > The maven artifacts are staged for now at:
> > > > https://repository.apache.org/content/repositories/orgapacheqpid-1160
> > > >
> > > > The JIRAs assigned are:
> > > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313720=12343875
> > > >
> > > > Regards,
> > > > Robbie
> > > >
> > > > P.S. If you want to test things out using maven with your own build
> > > > you can temporarily add this to your poms to access the staging repo:
> > > >
> > > >   
> > > > 
> > > >   staging
> > > >   
> > > > https://repository.apache.org/content/repositories/orgapacheqpid-1160
> > > > 
> > > >   
> > > >
> > > > The dependency for proton-j would then be:
> > > >
> > > >   
> > > > org.apache.qpid
> > > > proton-j
> > > > 0.30.0
> > > >   
> > > >
> > > > -
> > > > To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org
> > > > For additional commands, e-mail: users-h...@qpid.apache.org
> > > >
> > >
> > > -
> > > To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org
> > > For additional commands, e-mail: users-h...@qpid.apache.org
> > >
> >
> > -
> > To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org
> > For additional commands, e-mail: users-h...@qpid.apache.org
> >
>
> -
> To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org
> For additional commands, e-mail: users-h...@qpid.apache.org
>

-
To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org
For additional commands, e-mail: users-h...@qpid.apache.org



Re: qpid-cpp-0.35 errors

2018-11-08 Thread Kim van der Riet

On 11/7/18 3:18 PM, rammohan ganapavarapu wrote:

Kim,

Ok, i am still trying to see what part of my java application is causing
that issue, yes that issue is happening intermittently. Regarding
"JERR_WMGR_ENQDISCONT" error, may be they are chained exceptions from the
previous error JERR_JCNTL_AIOCMPLWAIT?
In my mind, it is more likely the other way around. But the logs should 
tell you that. It would be best to start with a clean store before each 
test so you don't inherit issues from a previous test or run.


Does message size contribute to this issue?


Yes, but only in the sense that the size alters the packing of the write 
buffers, and the timing of when they are written. Also, the number of 
simultaneous producers and consumers will affect this. In particular, 
when two consumers are simultaneously sending messages to the same 
queue, also if a consumer is consuming from a queue while a producer is 
also sending are going to be the main factors in any race condition such 
as I suspect this is. Playing with those will give clues as to what is 
happening. You could try the following, each time starting with a clean 
store:


1. Only allowing a single producer, followed by a single consumer (ie 
not at the same time);


2. Allowing a single producer and a single consumer to operate 
simultaneously;


3. Allowing multiple producers (I don't know if your use-case has this) only

4. Allowing multiple consumers.

Once you have isolated which scenarios cause the problem, then try 
varying the message size. The answers to these will help isolating where 
the issue is happening.




Thanks,
Ram

On Wed, Nov 7, 2018 at 11:37 AM Kim van der Riet 
wrote:


No, they are not.

These two defines govern the number of sleeps and the sleep time while
waiting for before throwing an exception during recovery only. They do
not play a role during normal operation.

If you are able to compile the broker code, you can try playing with
these values. But I don't think they will make much difference to the
overall problem. I think some of the other errors you have been seeing
prior to this one are closer to where the real problem lies - such as
the JRNL_WMGR_ENQDISCONT error.

Do you have a reproducer of any kind? Does this error occur predictably
under some or other conditions?

Thanks,

Kim van der Riet

On 11/7/18 12:51 PM, rammohan ganapavarapu wrote:

Kim,

I see these two settings from code, can these be configurable?

#define MAX_AIO_SLEEPS 10 // tot: ~1 sec

#define AIO_SLEEP_TIME_US  10 // 0.01 ms


Ram

On Wed, Nov 7, 2018 at 7:04 AM rammohan ganapavarapu <
rammohanga...@gmail.com> wrote:


Thank you Kim, i will try your suggestions.

On Wed, Nov 7, 2018, 6:58 AM Kim van der Riet 
wrote:

This error is a linearstore issue. It looks as though there is a single
write operation to disk that has become stuck, and is holding up all
further write operations. This happens because there is a fixed

circular

pool of memory pages used for the AIO operations to disk, and when one
of these is "busy" (indicated by the A letter in the  page state map),
write operations cannot continue until it is cleared. It it does not
clear within a certain time, then an exception is thrown, which usually
results in the broker closing the connection.

The events leading up to a "stuck" write operation are complex and
sometimes difficult to reproduce. If you have a reproducer, then I

would

be interested to see it! Even so, the ability to reproduce on another
machine is hard as it depends on such things as disk write speed, the
disk controller characteristics, the number of threads in the thread
pool (ie CPU type), memory and other hardware-related things.

There are two linearstore parameters that you can try playing with to
see if you can change the behavior of the store:

wcache-page-size: This sets the size of each page in the write buffer.
Larger page size is good for large messages, a smaller size will help

if

you have small messages.

wchache-num-pages: The total number of pages in the write buffer.

Use the --help on the broker with the linearstore loaded to see more
details on this. I hope that helps a little.

Kim van der Riet

On 11/6/18 2:12 PM, rammohan ganapavarapu wrote:

Any help in understand why/when broker throws those errors and stop
receiving message would be appreciated.

Not sure if any kernel tuning or broker tuning needs to be done to
solve this issue.

Thanks in advance,
Ram

On Tue, Nov 6, 2018 at 8:35 AM rammohan ganapavarapu <
rammohanga...@gmail.com> wrote:


Also from this log message (store level) it seems like waiting for

AIO

to

complete.

2018-10-28 12:27:01 [Store] critical Linear Store: Journal "": get_events() returned JERR_JCNTL_AIOCMPLWAIT;
wmgr_status: wmgr: pi=25 pc=8 po=0 aer=1 edac=TFFF
ps=[-A--]

page_state ps=[-A--]  where A is

AIO_PENDING

aer=1 _aio_evt_rem;  ///< Remaining AIO events

When there is or there are pending 

Re: qpid-cpp-0.35 errors

2018-11-08 Thread rammohan ganapavarapu
Do you have any kernel (net/disk) tuning recommendations for qpid-cpp with
linearstore?

Ram

On Thu, Nov 8, 2018 at 8:56 AM rammohan ganapavarapu <
rammohanga...@gmail.com> wrote:

> Kim/Gordon,
>
> I was wrong about the NFS for qpid journal files, looks like they are on
> NFS, so does NFS cause this issue?
>
> Ram
>
> On Wed, Nov 7, 2018 at 12:18 PM rammohan ganapavarapu <
> rammohanga...@gmail.com> wrote:
>
>> Kim,
>>
>> Ok, i am still trying to see what part of my java application is causing
>> that issue, yes that issue is happening intermittently. Regarding
>> "JERR_WMGR_ENQDISCONT" error, may be they are chained exceptions from the
>> previous error JERR_JCNTL_AIOCMPLWAIT?
>>
>> Does message size contribute to this issue?
>>
>> Thanks,
>> Ram
>>
>> On Wed, Nov 7, 2018 at 11:37 AM Kim van der Riet 
>> wrote:
>>
>>> No, they are not.
>>>
>>> These two defines govern the number of sleeps and the sleep time while
>>> waiting for before throwing an exception during recovery only. They do
>>> not play a role during normal operation.
>>>
>>> If you are able to compile the broker code, you can try playing with
>>> these values. But I don't think they will make much difference to the
>>> overall problem. I think some of the other errors you have been seeing
>>> prior to this one are closer to where the real problem lies - such as
>>> the JRNL_WMGR_ENQDISCONT error.
>>>
>>> Do you have a reproducer of any kind? Does this error occur predictably
>>> under some or other conditions?
>>>
>>> Thanks,
>>>
>>> Kim van der Riet
>>>
>>> On 11/7/18 12:51 PM, rammohan ganapavarapu wrote:
>>> > Kim,
>>> >
>>> > I see these two settings from code, can these be configurable?
>>> >
>>> > #define MAX_AIO_SLEEPS 10 // tot: ~1 sec
>>> >
>>> > #define AIO_SLEEP_TIME_US  10 // 0.01 ms
>>> >
>>> >
>>> > Ram
>>> >
>>> > On Wed, Nov 7, 2018 at 7:04 AM rammohan ganapavarapu <
>>> > rammohanga...@gmail.com> wrote:
>>> >
>>> >> Thank you Kim, i will try your suggestions.
>>> >>
>>> >> On Wed, Nov 7, 2018, 6:58 AM Kim van der Riet >> wrote:
>>> >>
>>> >>> This error is a linearstore issue. It looks as though there is a
>>> single
>>> >>> write operation to disk that has become stuck, and is holding up all
>>> >>> further write operations. This happens because there is a fixed
>>> circular
>>> >>> pool of memory pages used for the AIO operations to disk, and when
>>> one
>>> >>> of these is "busy" (indicated by the A letter in the  page state
>>> map),
>>> >>> write operations cannot continue until it is cleared. It it does not
>>> >>> clear within a certain time, then an exception is thrown, which
>>> usually
>>> >>> results in the broker closing the connection.
>>> >>>
>>> >>> The events leading up to a "stuck" write operation are complex and
>>> >>> sometimes difficult to reproduce. If you have a reproducer, then I
>>> would
>>> >>> be interested to see it! Even so, the ability to reproduce on another
>>> >>> machine is hard as it depends on such things as disk write speed, the
>>> >>> disk controller characteristics, the number of threads in the thread
>>> >>> pool (ie CPU type), memory and other hardware-related things.
>>> >>>
>>> >>> There are two linearstore parameters that you can try playing with to
>>> >>> see if you can change the behavior of the store:
>>> >>>
>>> >>> wcache-page-size: This sets the size of each page in the write
>>> buffer.
>>> >>> Larger page size is good for large messages, a smaller size will
>>> help if
>>> >>> you have small messages.
>>> >>>
>>> >>> wchache-num-pages: The total number of pages in the write buffer.
>>> >>>
>>> >>> Use the --help on the broker with the linearstore loaded to see more
>>> >>> details on this. I hope that helps a little.
>>> >>>
>>> >>> Kim van der Riet
>>> >>>
>>> >>> On 11/6/18 2:12 PM, rammohan ganapavarapu wrote:
>>>  Any help in understand why/when broker throws those errors and stop
>>>  receiving message would be appreciated.
>>> 
>>>  Not sure if any kernel tuning or broker tuning needs to be done to
>>>  solve this issue.
>>> 
>>>  Thanks in advance,
>>>  Ram
>>> 
>>>  On Tue, Nov 6, 2018 at 8:35 AM rammohan ganapavarapu <
>>>  rammohanga...@gmail.com> wrote:
>>> 
>>> > Also from this log message (store level) it seems like waiting for
>>> AIO
>>> >>> to
>>> > complete.
>>> >
>>> > 2018-10-28 12:27:01 [Store] critical Linear Store: Journal
>>> ">> > name>": get_events() returned JERR_JCNTL_AIOCMPLWAIT;
>>> > wmgr_status: wmgr: pi=25 pc=8 po=0 aer=1 edac=TFFF
>>> > ps=[-A--]
>>> >
>>> > page_state ps=[-A--]  where A is
>>> >>> AIO_PENDING
>>> > aer=1 _aio_evt_rem;  ///< Remaining AIO events
>>> >
>>> > When there is or there are pending AIO, does broker close the
>>> >>> connection?
>>> > is there any tuning that can be done to resolve this?
>>> >
>>> > Thanks,
>>> > Ram
>>> >
>>> 

Re: qpid-cpp-0.35 errors

2018-11-08 Thread Kim van der Riet

Resending, did not show up on the list the first time I sent it...



 Forwarded Message 
Subject:Re: qpid-cpp-0.35 errors
Date:   Thu, 8 Nov 2018 09:30:24 -0500
From:   Kim van der Riet 
To: users@qpid.apache.org



On 11/7/18 3:18 PM, rammohan ganapavarapu wrote:

Kim,

Ok, i am still trying to see what part of my java application is causing
that issue, yes that issue is happening intermittently. Regarding
"JERR_WMGR_ENQDISCONT" error, may be they are chained exceptions from the
previous error JERR_JCNTL_AIOCMPLWAIT?
In my mind, it is more likely the other way around. But the logs should 
tell you that. It would be best to start with a clean store before each 
test so you don't inherit issues from a previous test or run.


Does message size contribute to this issue?


Yes, but only in the sense that the size alters the packing of the write 
buffers, and the timing of when they are written. Also, the number of 
simultaneous producers and consumers will affect this. In particular, 
when two consumers are simultaneously sending messages to the same 
queue, also if a consumer is consuming from a queue while a producer is 
also sending are going to be the main factors in any race condition such 
as I suspect this is. Playing with those will give clues as to what is 
happening. You could try the following, each time starting with a clean 
store:


1. Only allowing a single producer, followed by a single consumer (ie 
not at the same time);


2. Allowing a single producer and a single consumer to operate 
simultaneously;


3. Allowing multiple producers (I don't know if your use-case has this) only

4. Allowing multiple consumers.

Once you have isolated which scenarios cause the problem, then try 
varying the message size. The answers to these will help isolating where 
the issue is happening.




Thanks,
Ram

On Wed, Nov 7, 2018 at 11:37 AM Kim van der Riet 
wrote:


No, they are not.

These two defines govern the number of sleeps and the sleep time while
waiting for before throwing an exception during recovery only. They do
not play a role during normal operation.

If you are able to compile the broker code, you can try playing with
these values. But I don't think they will make much difference to the
overall problem. I think some of the other errors you have been seeing
prior to this one are closer to where the real problem lies - such as
the JRNL_WMGR_ENQDISCONT error.

Do you have a reproducer of any kind? Does this error occur predictably
under some or other conditions?

Thanks,

Kim van der Riet

On 11/7/18 12:51 PM, rammohan ganapavarapu wrote:

Kim,

I see these two settings from code, can these be configurable?

#define MAX_AIO_SLEEPS 10 // tot: ~1 sec

#define AIO_SLEEP_TIME_US 10 // 0.01 ms


Ram

On Wed, Nov 7, 2018 at 7:04 AM rammohan ganapavarapu <
rammohanga...@gmail.com> wrote:


Thank you Kim, i will try your suggestions.

On Wed, Nov 7, 2018, 6:58 AM Kim van der Riet 
wrote:
This error is a linearstore issue. It looks as though there is a 
single

write operation to disk that has become stuck, and is holding up all
further write operations. This happens because there is a fixed

circular

pool of memory pages used for the AIO operations to disk, and when one
of these is "busy" (indicated by the A letter in the page state map),
write operations cannot continue until it is cleared. It it does not
clear within a certain time, then an exception is thrown, which 
usually

results in the broker closing the connection.

The events leading up to a "stuck" write operation are complex and
sometimes difficult to reproduce. If you have a reproducer, then I

would

be interested to see it! Even so, the ability to reproduce on another
machine is hard as it depends on such things as disk write speed, the
disk controller characteristics, the number of threads in the thread
pool (ie CPU type), memory and other hardware-related things.

There are two linearstore parameters that you can try playing with to
see if you can change the behavior of the store:

wcache-page-size: This sets the size of each page in the write buffer.
Larger page size is good for large messages, a smaller size will help

if

you have small messages.

wchache-num-pages: The total number of pages in the write buffer.

Use the --help on the broker with the linearstore loaded to see more
details on this. I hope that helps a little.

Kim van der Riet

On 11/6/18 2:12 PM, rammohan ganapavarapu wrote:

Any help in understand why/when broker throws those errors and stop
receiving message would be appreciated.

Not sure if any kernel tuning or broker tuning needs to be done to
solve this issue.

Thanks in advance,
Ram

On Tue, Nov 6, 2018 at 8:35 AM rammohan ganapavarapu <
rammohanga...@gmail.com> wrote:


Also from this log message (store level) it seems like waiting for

AIO

to

complete.

2018-10-28 12:27:01 [Store] critical Linear Store: Journal "": get_events() returned JERR_JCNTL_AIOCMPLWAIT;
wmgr_status: