Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Kevin Burton Tue, 24 Feb 2015 10:23:48 -0800

oh  geez… seriously though.  can you give me an estimate?

Assuming 30 seconds each I’m assuming about 11 hours.


http://activemq.apache.org/junit-reports.html
https://hudson.apache.org/hudson/job/ActiveMQ/

Hudson is down for ActiveMQ.

Also, is there a disk space requirement for the tests? I have 10GB free or
so and I’m getting TONS of these:


java.lang.RuntimeException: Failed to start provided job scheduler store:
JobSchedulerStore:activemq-data

at java.io.RandomAccessFile.seek(Native Method)

at
org.apache.activemq.util.RecoverableRandomAccessFile.seek(RecoverableRandomAccessFile.java:384)

at
org.apache.activemq.store.kahadb.disk.page.PageFile.readPage(PageFile.java:877)

at
org.apache.activemq.store.kahadb.disk.page.Transaction.load(Transaction.java:427)

at
org.apache.activemq.store.kahadb.disk.page.Transaction.load(Transaction.java:377)

at
org.apache.activemq.store.kahadb.disk.index.BTreeIndex.load(BTreeIndex.java:159)

at
org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl$MetaData.load(JobSchedulerStoreImpl.java:90)

at
org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl$3.execute(JobSchedulerStoreImpl.java:277)

at
org.apache.activemq.store.kahadb.disk.page.Transaction.execute(Transaction.java:779)

at
org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl.doStart(JobSchedulerStoreImpl.java:261)

at org.apache.activemq.util.ServiceSupport.start(ServiceSupport.java:55)

at
org.apache.activemq.broker.BrokerService.setJobSchedulerStore(BrokerService.java:1891)

at
org.apache.activemq.transport.stomp.StompTestSupport.createBroker(StompTestSupport.java:174)

at
org.apache.activemq.transport.stomp.StompTestSupport.startBroker(StompTestSupport.java:112)

at
org.apache.activemq.transport.stomp.StompTestSupport.setUp(StompTestSupport.java:94)

at
org.apache.activemq.transport.stomp.Stomp12Test.setUp(Stomp12Test.java:41)

at
org.apache.activemq.transport.stomp.Stomp12SslAuthTest.setUp(Stomp12SslAuthTest.java:38)

On Tue, Feb 24, 2015 at 10:09 AM, Timothy Bish <[email protected]> wrote:

> On 02/24/2015 01:01 PM, Kevin Burton wrote:
>
>> How long do the tests usually take?
>>
>
> An eternity + 1hr
>
>  I’m looking at 45 minutes right now
>> before I gave up… I think part of it was that the box I was testing on was
>> virtualized and didn’t have enough resources.
>>
>> I tried to parallelize the tests (-T 8 with maven) but I got other errors
>> so I assume the ports are singletons.
>>
> They won't be happy if you run them parallel.
>
>
>
>> On Tue, Feb 24, 2015 at 8:03 AM, Gary Tully <[email protected]> wrote:
>>
>>  if there are any test failures - try to run them individually
>>> -Dtest=a,b etc. There may be an issue with a full test run, but all of
>>> the tests that are enabled should work. I know there are some issues
>>> with jdbc tests that hang or fail due to previous runs no cleaning up,
>>> but that should be the most of it. I got a bunch of full test runs
>>> before the 5.11 release if that is any help.
>>>
>>> On 23 February 2015 at 20:38, Kevin Burton <[email protected]> wrote:
>>>
>>>> OK.  This is ready to go and I have a patch branch:
>>>>
>>>> https://issues.apache.org/jira/browse/AMQ-5609
>>>>
>>>> I’m stuck at the moment though because tests don’t pass.  But it was
>>>> failing tests before so I don’t think it has anything to do with my
>>>> changes.
>>>>
>>>>
>>>>
>>>> On Sun, Feb 22, 2015 at 11:11 PM, Kevin Burton <[email protected]>
>>>>
>>> wrote:
>>>
>>>> Actually, is the lock even needed here?  Why would it be?  if we’re
>>>>> *removing* a subscription, why does it care if we possibly ALSO remove
>>>>> a
>>>>> separate / isolated queue before/after the subscription is removed.
>>>>>
>>>>> I think this is redundant and can be removed.  Maybe I’m wrong though.
>>>>>
>>>>> I looked at all the callers and none were associated with queues.
>>>>>
>>>>> On Sun, Feb 22, 2015 at 11:07 PM, Kevin Burton <[email protected]>
>>>>>
>>>> wrote:
>>>
>>>> So I have some working/theoretical code that should resolve this.
>>>>>>
>>>>>> It acquires a lock *per* ActiveMQDestination, this way there is no
>>>>>> lock
>>>>>> contention.
>>>>>>
>>>>>> But here’s where I’m stuck.
>>>>>>
>>>>>>      @Override
>>>>>>
>>>>>>>      public void removeSubscription(ConnectionContext context,
>>>>>>> RemoveSubscriptionInfo info) throws Exception {
>>>>>>>          inactiveDestinationsPurgeLock.readLock().lock();
>>>>>>>          try {
>>>>>>>              topicRegion.removeSubscription(context, info);
>>>>>>>          } finally {
>>>>>>>              inactiveDestinationsPurgeLock.readLock().unlock();
>>>>>>>          }
>>>>>>>      }
>>>>>>>
>>>>>>
>>>>>> .. this is in RegionBroker.
>>>>>>
>>>>>> There is no ActiveMQDestination involved here so I’m not sure the best
>>>>>> way to resolve this.
>>>>>>
>>>>>> Any advice?
>>>>>>
>>>>>>
>>>>>> On Sun, Feb 22, 2015 at 8:11 PM, Kevin Burton <[email protected]>
>>>>>>
>>>>> wrote:
>>>
>>>> Yes.  That was my thinking too.. that just replacing the
>>>>>>>
>>>>>> CopyOnWriteArraySet
>>>
>>>> with something more performance would solve the issue.
>>>>>>>
>>>>>>> This would also improve queue creation time as well as queue deletion
>>>>>>> time.
>>>>>>>
>>>>>>> What I think I”m going to do in the mean time is:
>>>>>>>
>>>>>>> - implement a granular lock based on queue name… I am going to use an
>>>>>>> interface so we can replace the implementation later.
>>>>>>>
>>>>>>> - implement timing for the purge thread so it tracks how long it
>>>>>>> takes
>>>>>>> to remove a queue but also how long the entire loop takes.
>>>>>>>
>>>>>>> I’ll do this on a branch so it should be easy to merge.
>>>>>>>
>>>>>>> On Sun, Feb 22, 2015 at 7:40 PM, Tim Bain <[email protected]>
>>>>>>>
>>>>>> wrote:
>>>
>>>> A decent amount of the time is being spent calling remove() on
>>>>>>>>
>>>>>>> various
>>>
>>>> array-backed collections.  Those data structures might be
>>>>>>>>
>>>>>>> inappropriate
>>>
>>>> for
>>>>>>>> the number of destinations you're running, since array-backed
>>>>>>>> collections
>>>>>>>> tend to have add/remove operations that are O(N); some improvement
>>>>>>>>
>>>>>>> might
>>>
>>>> come from something as simple as moving to a ConcurrentHashSet
>>>>>>>>
>>>>>>> instead
>>>
>>>> of a
>>>>>>>> CopyOnWriteArraySet, for example.  (Or it might make performance
>>>>>>>>
>>>>>>> worse
>>>
>>>> because of other aspects of how those collections are used; people
>>>>>>>>
>>>>>>> other
>>>
>>>> than me would be in a better position to evaluate the full range of
>>>>>>>> performance requirements for those collections.)
>>>>>>>>
>>>>>>>> Scheduler.cancel() also takes an alarming amount of time for what
>>>>>>>>
>>>>>>> looks
>>>
>>>> like a really simple method (
>>>>>>>>
>>>>>>>>
>>>>>>>>  http://grepcode.com/file/repo1.maven.org/maven2/org.
>>> apache.activemq/activemq-all/5.10.0/org/apache/activemq/
>>> thread/Scheduler.java#Scheduler.cancel%28java.lang.Runnable%29
>>>
>>>> ).
>>>>>>>>
>>>>>>>> On Sun, Feb 22, 2015 at 7:56 PM, Kevin Burton <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  Here’s a jprofiler view with the advisory support enabled if you’re
>>>>>>>>> curious.
>>>>>>>>>
>>>>>>>>> http://i.imgur.com/I1jesZz.jpg
>>>>>>>>>
>>>>>>>>> I’m not familiar with the internals of ActiveMQ enough to have any
>>>>>>>>>
>>>>>>>> obvious
>>>>>>>>
>>>>>>>>> optimization ideas.
>>>>>>>>>
>>>>>>>>> One other idea I had (which would require a ton of refactoring I
>>>>>>>>>
>>>>>>>> think)
>>>>>>>>
>>>>>>>>> would be to potentially bulk delete all the queues at once.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Feb 22, 2015 at 6:42 PM, Kevin Burton <[email protected]>
>>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> And spending some more time in jprofiler, looks like 20% of this
>>>>>>>>>>
>>>>>>>>> is
>>>
>>>> do to
>>>>>>>>
>>>>>>>>> schedulerSupport and the other 80% of this is do to
>>>>>>>>>>
>>>>>>>>> advisorySupport.
>>>
>>>> if I set both to false the total runtime of my tests drops in
>>>>>>>>>>
>>>>>>>>> half…
>>>
>>>> and
>>>>>>>>
>>>>>>>>> the latencies fall from
>>>>>>>>>>
>>>>>>>>>> max create producer latency: 10,176 ms
>>>>>>>>>>
>>>>>>>>>>> max create message on existing producer and consumer: 2 ms
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> … to
>>>>>>>>>>
>>>>>>>>>> max create producer latency: 1 ms
>>>>>>>>>>
>>>>>>>>>>> max create message on existing producer and consumer: 1 ms
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> and this isn’t without fixing the purge background lock.
>>>>>>>>>>
>>>>>>>>>> So the question now is what the heck is the advisory support
>>>>>>>>>>
>>>>>>>>> doing
>>>
>>>> that
>>>>>>>>
>>>>>>>>> can result in such massive performance overhead.
>>>>>>>>>>
>>>>>>>>>> … and I think advisorySupport is enabled by default so that’s
>>>>>>>>>>
>>>>>>>>> problematic
>>>>>>>>
>>>>>>>>> as well.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Feb 22, 2015 at 4:45 PM, Kevin Burton <
>>>>>>>>>>
>>>>>>>>> [email protected]>
>>>
>>>> wrote:
>>>>>>>>>
>>>>>>>>>> OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.
>>>>>>>>>>>
>>>>>>>>>>> This is a non-persistent broker I’m testing on.
>>>>>>>>>>>
>>>>>>>>>>> Looks like it’s spending all it’s time in
>>>>>>>>>>>
>>>>>>>>>> CopyOnWriteArrayList.remove
>>>>>>>>
>>>>>>>>> and
>>>>>>>>>
>>>>>>>>>> Timer.purge…
>>>>>>>>>>>
>>>>>>>>>>> Which is hopeful because this is ALL due to ActiveMQ internals
>>>>>>>>>>>
>>>>>>>>>> and
>>>
>>>> in
>>>>>>>>
>>>>>>>>> theory LevelDB should perform well if we improve the
>>>>>>>>>>>
>>>>>>>>>> performance of
>>>
>>>> ActiveMQ internals and fix this lock bug.
>>>>>>>>>>>
>>>>>>>>>>> Which would rock!
>>>>>>>>>>>
>>>>>>>>>>> It should ALSO make queue creation faster.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <
>>>>>>>>>>>
>>>>>>>>>> [email protected]>
>>>
>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <
>>>>>>>>>>>>
>>>>>>>>>>> [email protected]>
>>>
>>>> wrote:
>>>>>>>>>
>>>>>>>>>> So if LevelDB cleanup during removeDestination() is the
>>>>>>>>>>>>>
>>>>>>>>>>>> presumed
>>>
>>>> culprit,
>>>>>>>>>>>>> can we spin off the LevelDB cleanup work into a separate
>>>>>>>>>>>>>
>>>>>>>>>>>> thread
>>>
>>>> (better: a
>>>>>>>>>>>>> task object to be serviced by a ThreadPool so you can avoid a
>>>>>>>>>>>>>
>>>>>>>>>>>> fork
>>>>>>>>
>>>>>>>>> bomb
>>>>>>>>>
>>>>>>>>>> if
>>>>>>>>>>>>> we remove many destinations at once) so the call to
>>>>>>>>>>>>>
>>>>>>>>>>>> removeDestination()
>>>>>>>>>
>>>>>>>>>> can
>>>>>>>>>>>>> return quickly and LevelDB can do its record-keeping in the
>>>>>>>>>>>>>
>>>>>>>>>>>> background
>>>>>>>>
>>>>>>>>> without blocking message-processing?
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Would that be possible?  If the delete is pending on ActiveMQ
>>>>>>>>>>>>
>>>>>>>>>>> there is
>>>>>>>>
>>>>>>>>> a
>>>>>>>>>
>>>>>>>>>> race where a producer could re-create it unless the lock is
>>>>>>>>>>>>
>>>>>>>>>>> held.
>>>
>>>> Though I guess if you dispatched to the GC thread WITH the lock
>>>>>>>>>>>>
>>>>>>>>>>> still
>>>>>>>>
>>>>>>>>> held you would be ok but I think if we use the existing purge
>>>>>>>>>>>>
>>>>>>>>>>> thread
>>>>>>>>
>>>>>>>>> then
>>>>>>>>>
>>>>>>>>>> we’re fine.
>>>>>>>>>>>>
>>>>>>>>>>>> OK. I think I’m wrong about LevelDB being the issue.  To be
>>>>>>>>>>>>
>>>>>>>>>>> fair I
>>>
>>>> wasn’t 100% certain before but I should have specified.
>>>>>>>>>>>>
>>>>>>>>>>>> Our current production broker is running with
>>>>>>>>>>>>
>>>>>>>>>>> persistent=false..
>>>
>>>> .and I
>>>>>>>>
>>>>>>>>> just re-ran the tests without disk persistence and it has the
>>>>>>>>>>>>
>>>>>>>>>>> same
>>>
>>>> problem.
>>>>>>>>>
>>>>>>>>>> So the main issue how is why the heck is ActiveMQ taking SO
>>>>>>>>>>>>
>>>>>>>>>>> LONG
>>>
>>>> to GC
>>>>>>>>
>>>>>>>>> a
>>>>>>>>>
>>>>>>>>>> queue.  It’s taking about 100ms which is an insane amount of
>>>>>>>>>>>>
>>>>>>>>>>> time
>>>
>>>> considering this is done all in memory.
>>>>>>>>>>>>
>>>>>>>>>>>> Kevin
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>> … or check out my Google+ profile
>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Founder/CEO Spinn3r.com
>>>>>>> Location: *San Francisco, CA*
>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>> … or check out my Google+ profile
>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>> <http://spinn3r.com>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>>
>>>>>> Founder/CEO Spinn3r.com
>>>>>> Location: *San Francisco, CA*
>>>>>> blog: http://burtonator.wordpress.com
>>>>>> … or check out my Google+ profile
>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>> <http://spinn3r.com>
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>>
>>>>> Founder/CEO Spinn3r.com
>>>>> Location: *San Francisco, CA*
>>>>> blog: http://burtonator.wordpress.com
>>>>> … or check out my Google+ profile
>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>> <http://spinn3r.com>
>>>>>
>>>>>
>>>>>
>>>> --
>>>>
>>>> Founder/CEO Spinn3r.com
>>>> Location: *San Francisco, CA*
>>>> blog: http://burtonator.wordpress.com
>>>> … or check out my Google+ profile
>>>> <https://plus.google.com/102718274791889610666/posts>
>>>> <http://spinn3r.com>
>>>>
>>>
>>
>>
>
> --
> Tim Bish
> Sr Software Engineer | RedHat Inc.
> [email protected] | www.redhat.com
> skype: tabish121 | twitter: @tabish121
> blog: http://timbish.blogspot.com/
>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Reply via email to