[jira] [Commented] (ARTEMIS-2113) When AMQP links are opened and closed in quick succession, Artemis doesn’t always respond with attach/detach frames confirming the opening/closing of the links

2018-10-21 Thread Simon Chalmers (JIRA)


[ 
https://issues.apache.org/jira/browse/ARTEMIS-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658436#comment-16658436
 ] 

Simon Chalmers commented on ARTEMIS-2113:
-

If we were able to provide a sample test app, which could be run on a Linux 
environment would that assist in further discussion on this issue?

A simple 64bit Linux enviroment, such as Debian, Ubuntu, Amazon Linux et al, 
along with the .NET 2.1 SDK would be all that is required to run the sample 
test app and reproduce the error.

> When AMQP links are opened and closed in quick succession, Artemis doesn’t 
> always respond with attach/detach frames confirming the opening/closing of 
> the links
> ---
>
> Key: ARTEMIS-2113
> URL: https://issues.apache.org/jira/browse/ARTEMIS-2113
> Project: ActiveMQ Artemis
>  Issue Type: Bug
>  Components: AMQP
>Affects Versions: 2.6.3
> Environment: This test was done in .NET using the Amqp.Net Lite 
> library 2.1.4 (which is the latest version).
>Reporter: Simon Chalmers
>Priority: Major
>
> A trace of a message exchange with Artemis where this has occurred is below. 
> The lines highlighted in yellow show opening and then closing two AMQP links, 
> but Artemis doesn’t respond after that with its own attach/detach frames, 
> which acknowledge the opening/closing of those links. The lines highlighted 
> in green are heartbeat frames sent to Artemis, sent every 15 seconds, which 
> illustrates that one minute passed without receiving the attach/detach frames 
> from Artemis.
> SEND AMQP 3 1 0 0
> RECV sasl-mechanisms(sasl-server-mechanisms:[PLAIN,ANONYMOUS])
> SEND sasl-init(mechanism:PLAIN,initial-response:...,hostname:localhost)
> RECV sasl-outcome(code:0)
> SEND AMQP 0 1.0.0
> SEND (ch=0) 
> open(container-id:AMQPNetLite-43a6c9ad,host-name:localhost,max-frame-size:262144,channel-max:255,idle-time-out:2147483647)
> RECV AMQP 0 1 0 0
> SEND (ch=0) 
> begin(next-outgoing-id:4294967293,incoming-window:2048,outgoing-window:2048,handle-max:63)
> RECV (ch=0) 
> open(container-id:0.0.0.0,max-frame-size:131072,channel-max:65535,idle-time-out:3,offered-capabilities:[sole-connection-for-container,DELAYED_DELIVERY,SHARED-SUBS,ANONYMOUS-RELAY],properties:[product:apache-activemq-artemis,version:2.6.3])
> {color:#f6c342}SEND (ch=0) 
> attach(name:link1,handle:0,role:False,source:source(),target:target(address:q1),initial-delivery-count:0)
> SEND (ch=0) detach(handle:0,closed:True)
> SEND (ch=0) 
> attach(name:link0,handle:1,role:False,source:source(),target:target(address:q0),initial-delivery-count:0)
> SEND (ch=0) detach(handle:1,closed:True){color}
> RECV (ch=0) 
> begin(remote-channel:0,next-outgoing-id:1,incoming-window:2147483647,outgoing-window:2147483647,handle-max:65535)
> {color:#14892c}SEND (ch=0) empty
> SEND (ch=0) empty
> SEND (ch=0) empty
> SEND (ch=0) empty{color}
> Note that this doesn’t always happen. Sometimes Artemis does respond, as 
> shown by the lines highlighted in bold/grey in the trace below.
> SEND AMQP 3 1 0 0
> RECV sasl-mechanisms(sasl-server-mechanisms:[PLAIN,ANONYMOUS])
> SEND sasl-init(mechanism:PLAIN,initial-response:...,hostname:localhost)
> RECV sasl-outcome(code:0)
> SEND AMQP 0 1.0.0
> SEND (ch=0) 
> open(container-id:AMQPNetLite-b00e0be7,host-name:localhost,max-frame-size:262144,channel-max:255,idle-time-out:2147483647)
> RECV AMQP 0 1 0 0
> RECV (ch=0) 
> open(container-id:0.0.0.0,max-frame-size:131072,channel-max:65535,idle-time-out:3,offered-capabilities:[sole-connection-for-container,DELAYED_DELIVERY,SHARED-SUBS,ANONYMOUS-RELAY],properties:[product:apache-activemq-artemis,version:2.6.3])
> SEND (ch=0) 
> begin(next-outgoing-id:4294967293,incoming-window:2048,outgoing-window:2048,handle-max:63)
> RECV (ch=0) 
> begin(remote-channel:0,next-outgoing-id:1,incoming-window:2147483647,outgoing-window:2147483647,handle-max:65535)
> {color:#f6c342}SEND (ch=0) 
> attach(name:link1,handle:0,role:False,source:source(),target:target(address:q1),initial-delivery-count:0)
> SEND (ch=0) detach(handle:0,closed:True)
> SEND (ch=0) 
> attach(name:link0,handle:1,role:False,source:source(),target:target(address:q0),initial-delivery-count:0)
> SEND (ch=0) detach(handle:1,closed:True){color}
> *{color:#707070}RECV (ch=0) 
> attach(name:link1,handle:0,role:True,snd-settle-mode:2,rcv-settle-mode:0,source:source(),target:target(address:q1)){color}*
> RECV (ch=0) 
> flow(next-in-id:4294967293,in-window:2147483647,next-out-id:1,out-window:2147483647,handle:0,delivery-count:0,link-credit:1000)
> *{color:#707070}RECV (ch=0) detach(handle:0,closed:True){color}*
> {color:#f6c342}SEND (ch=0) 
> 

[jira] [Commented] (AMQ-7067) KahaDB Recovery can experience a dangling transaction when prepare and commit occur on different data files.

2018-10-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/AMQ-7067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658338#comment-16658338
 ] 

ASF GitHub Bot commented on AMQ-7067:
-

Github user jgoodyear closed the pull request at:

https://github.com/apache/activemq/pull/302


> KahaDB Recovery can experience a dangling transaction when prepare and commit 
> occur on different data files.
> 
>
> Key: AMQ-7067
> URL: https://issues.apache.org/jira/browse/AMQ-7067
> Project: ActiveMQ
>  Issue Type: Bug
>  Components: KahaDB, XA
>Affects Versions: 5.15.6
>Reporter: Jamie goodyear
>Assignee: Gary Tully
>Priority: Critical
> Fix For: 5.16.0, 5.15.7
>
> Attachments: amq7067test.patch
>
>
> KahaDB Recovery can experience a dangling transaction when prepare and commit 
> occur on different data files.
> Scenario:
> A XA Transaction is started, message is prepared and sent into Broker.
> We then send into broker enough messages to file page file (100 message with 
> 512 * 1024 characters in message payload). This forces a new data file to be 
> created.
> Commit the XA transaction. Commit will land on the new data file.
> Restart the Broker.
> Upon restart a KahaDB recovery is executed.
> The prepare in PageFile 1 is not matched to Commit on PageFile 2, as such, it 
> will appear in recovered message state.
> Looking deeper into this scenario, it appears that the commit message is 
> GC'd, hence the prepare & commit can not be matched.
> The MessageDatabase only checks the following for GC:
> {color:#808080}// Don't GC files referenced by in-progress 
> tx{color}{color:#cc7832}if {color}(inProgressTxRange[{color:#6897bb}0{color}] 
> != {color:#cc7832}null{color}) {
>  {color:#cc7832}for {color}({color:#cc7832}int 
> {color}pendingTx=inProgressTxRange[{color:#6897bb}0{color}].getDataFileId(){color:#cc7832};
>  {color}pendingTx <= 
> inProgressTxRange[{color:#6897bb}1{color}].getDataFileId(){color:#cc7832}; 
> {color}pendingTx++) {
>  gcCandidateSet.remove(pendingTx){color:#cc7832};{color} }
>  }
> We need to become aware of where the prepare & commits occur in pagefiles 
> with respect to GCing files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AMQ-7082) KahaDB index, recover free pages in parallel with start

2018-10-21 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/AMQ-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658304#comment-16658304
 ] 

ASF subversion and git services commented on AMQ-7082:
--

Commit c6103415b9a185776ebc16343c6574f7ff884806 in activemq's branch 
refs/heads/activemq-5.15.x from gtully
[ https://git-wip-us.apache.org/repos/asf?p=activemq.git;h=c610341 ]

AMQ-7082 - recover index free pages in parallel with start, merge in flush, 
clean shutdown if complete. follow up on AMQ-6590


> KahaDB index, recover free pages in parallel with start
> ---
>
> Key: AMQ-7082
> URL: https://issues.apache.org/jira/browse/AMQ-7082
> Project: ActiveMQ
>  Issue Type: Bug
>  Components: KahaDB
>Affects Versions: 5.15.0
>Reporter: Gary Tully
>Assignee: Gary Tully
>Priority: Major
> Fix For: 5.16.0, 5.15.7
>
>
> AMQ-6590 fixes free page loss through recovery. The recover process can be 
> timely, which prevents fast failover, doing recovery on shutdown is 
> preferable, but it is still not ideal b/c it will hold onto the kahadb lock. 
> It also can stall shutdown unexpectedly.
> AMQ-7080 is going to tackle checkpointing the free list. This should help 
> avoid the need for recovery but it may still be necessary. If the perf hit is 
> significant this may need to be optional.
> There will still be the need to walk the index to find the free list.
> It is possible to run with no free list and grow, and we can do that while we 
> recover the free list in parallel, then merge the two at a safe point. This 
> we can do at startup.
> In cases where the disk is the bottleneck this won't help much, but it will 
> help failover and it will help shutdown, with a bit of luck the recovery will 
> complete before we stop.
>  
> Initially I thought this would be too complex, but if we concede some growth 
> while we recover, ie: start with an empty free list, it is should be straight 
> forward to merge with a recovered one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AMQ-6590) KahaDB index loses track of free pages on unclean shutdown

2018-10-21 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/AMQ-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658305#comment-16658305
 ] 

ASF subversion and git services commented on AMQ-6590:
--

Commit c6103415b9a185776ebc16343c6574f7ff884806 in activemq's branch 
refs/heads/activemq-5.15.x from gtully
[ https://git-wip-us.apache.org/repos/asf?p=activemq.git;h=c610341 ]

AMQ-7082 - recover index free pages in parallel with start, merge in flush, 
clean shutdown if complete. follow up on AMQ-6590


> KahaDB index loses track of free pages on unclean shutdown
> --
>
> Key: AMQ-6590
> URL: https://issues.apache.org/jira/browse/AMQ-6590
> Project: ActiveMQ
>  Issue Type: Bug
>  Components: Broker
>Affects Versions: 5.14.3
>Reporter: Christopher L. Shannon
>Assignee: Christopher L. Shannon
>Priority: Major
> Fix For: 5.15.0, 5.14.4, 5.16.0, 5.15.7
>
>
> I have discovered an issue with the KahaDB index recovery after an unclean 
> shutdown (OOM error, kill -9, etc) that leads to excessive disk space usage. 
> Normally on clean shutdown the index stores the known set of free pages to 
> db.free and reads that in on start up to know which pages can be re-used.  On 
> an unclean shutdown this is not written to disk so on start up the index is 
> supposed to scan the page file to figure out all of the free pages.
> Unfortunately it turns out that this scan of the page file is being done 
> before the total page count value has been set so when the iterator is 
> created it always thinks there are 0 pages to scan.
> The end result is that every time an unclean shutdown occurs all known free 
> pages are lost and no longer tracked.  This of course means new free pages 
> have to be allocated and all of the existing space is now lost which will 
> lead to excessive index file growth over time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AMQ-7082) KahaDB index, recover free pages in parallel with start

2018-10-21 Thread Jeff Genender (JIRA)


 [ 
https://issues.apache.org/jira/browse/AMQ-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Genender resolved AMQ-7082.

Resolution: Fixed

> KahaDB index, recover free pages in parallel with start
> ---
>
> Key: AMQ-7082
> URL: https://issues.apache.org/jira/browse/AMQ-7082
> Project: ActiveMQ
>  Issue Type: Bug
>  Components: KahaDB
>Affects Versions: 5.15.0
>Reporter: Gary Tully
>Assignee: Gary Tully
>Priority: Major
> Fix For: 5.16.0, 5.15.7
>
>
> AMQ-6590 fixes free page loss through recovery. The recover process can be 
> timely, which prevents fast failover, doing recovery on shutdown is 
> preferable, but it is still not ideal b/c it will hold onto the kahadb lock. 
> It also can stall shutdown unexpectedly.
> AMQ-7080 is going to tackle checkpointing the free list. This should help 
> avoid the need for recovery but it may still be necessary. If the perf hit is 
> significant this may need to be optional.
> There will still be the need to walk the index to find the free list.
> It is possible to run with no free list and grow, and we can do that while we 
> recover the free list in parallel, then merge the two at a safe point. This 
> we can do at startup.
> In cases where the disk is the bottleneck this won't help much, but it will 
> help failover and it will help shutdown, with a bit of luck the recovery will 
> complete before we stop.
>  
> Initially I thought this would be too complex, but if we concede some growth 
> while we recover, ie: start with an empty free list, it is should be straight 
> forward to merge with a recovered one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AMQ-7082) KahaDB index, recover free pages in parallel with start

2018-10-21 Thread Jeff Genender (JIRA)


[ 
https://issues.apache.org/jira/browse/AMQ-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658306#comment-16658306
 ] 

Jeff Genender commented on AMQ-7082:


[~gtully], I love the merge...this makes sense and is a quick combination.  
Since the flush() happens in the checkpointUpdate, that makes this thread safe. 
 Nice thinking about merging, that was a nice solution outside the box.

This definitely needs to go into 5.15.7.  I went ahead and did that as this 
closes the loop on any delays from freeList scanning.

 

> KahaDB index, recover free pages in parallel with start
> ---
>
> Key: AMQ-7082
> URL: https://issues.apache.org/jira/browse/AMQ-7082
> Project: ActiveMQ
>  Issue Type: Bug
>  Components: KahaDB
>Affects Versions: 5.15.0
>Reporter: Gary Tully
>Assignee: Gary Tully
>Priority: Major
> Fix For: 5.16.0, 5.15.7
>
>
> AMQ-6590 fixes free page loss through recovery. The recover process can be 
> timely, which prevents fast failover, doing recovery on shutdown is 
> preferable, but it is still not ideal b/c it will hold onto the kahadb lock. 
> It also can stall shutdown unexpectedly.
> AMQ-7080 is going to tackle checkpointing the free list. This should help 
> avoid the need for recovery but it may still be necessary. If the perf hit is 
> significant this may need to be optional.
> There will still be the need to walk the index to find the free list.
> It is possible to run with no free list and grow, and we can do that while we 
> recover the free list in parallel, then merge the two at a safe point. This 
> we can do at startup.
> In cases where the disk is the bottleneck this won't help much, but it will 
> help failover and it will help shutdown, with a bit of luck the recovery will 
> complete before we stop.
>  
> Initially I thought this would be too complex, but if we concede some growth 
> while we recover, ie: start with an empty free list, it is should be straight 
> forward to merge with a recovered one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)