Re: Apache ManifoldCF Performance

Karl Wright Fri, 12 Sep 2014 11:37:03 -0700

Hi Paul,

Zookeeper is fine; file system multiprocess, though, was just broken by a
recent windows update, so don't use that.


Karl


On Fri, Sep 12, 2014 at 2:34 PM, Paul Boichat <[email protected]>
wrote:

> Hi,
>
> Will do. Am also putting in some extra debug to help narrow it down.
>
> We're using zookeeper multiprocess.
>
> Paul
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Fri, Sep 12, 2014 at 5:51 PM, Karl Wright <[email protected]> wrote:
>
>> It's done actually; my new laptop with PostgreSQL 9.3 can do 111111
>> documents in roughly 15 minutes.  No problems encountered; roughly 120
>> documents per second.
>>
>> For sanity sake, could you try the following:
>>
>> - check out or unpack 1.6.1 sources
>> - lay down downloaded lib dependencies
>> - build using "ant build"
>> - modify properties.xml and start the approparite example
>>
>> Please see if you have any problems doing this process *without* any
>> patches.
>>
>> Also, what kind of synchronization are you using?  File based, zookeeper,
>> or single-process?
>>
>> Thanks,
>> Karl
>>
>>
>> On Fri, Sep 12, 2014 at 12:29 PM, Karl Wright <[email protected]> wrote:
>>
>>> Hi Paul,
>>>
>>> The query looks right; the database driver determines the maximum number
>>> of clauses in a conjunction OR list, just like it does for an IN() list.
>>> In the case of Postgresql and OR, the limit is 25; for IN()'s it's 100.
>>>
>>> The standard integration tests generally run small jobs but that is
>>> typically sufficient to find query generation problems.  I have load tests
>>> I can also run but they take several hours to complete.  I'll start one
>>> now, but I may need to abort it before it finishes.
>>>
>>> Karl
>>>
>>>
>>> On Fri, Sep 12, 2014 at 11:26 AM, Paul Boichat <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm looking through the logs - can see the change from IN to OR in each
>>>> query - and there's clearly a difference in execution path but it's quite
>>>> verbose so will take a while.
>>>>
>>>> It may well be that document state has not been reprioritised or in
>>>> some way inconsistent. However, I don't think it's that which is causing
>>>> the issue - I can switch this behaviour on and off over by changing the
>>>> DBInterfacePostgres class and restarting Manifold. That seems to suggest a
>>>> query isn't behaving the same way between IN and OR - I just can't isolate
>>>> the particular query (yet).
>>>>
>>>> Have you tested with a job already in running state (on a restart) with
>>>> a large document count? For example am seeing this kind of thing which
>>>> looks messy but appears to execute as you'd expect:
>>>>
>>>> SELECT
>>>> id,dockey,lastversion,lastoutputversion,authorityname,forcedparams FROM
>>>> ingeststatus WHERE  (dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=?) AND
>>>> connectionname=?]
>>>> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 0:
>>>> '1407144048431:F42CD76D66FA6BAD396FF8F8A409DD211C184E6A'
>>>> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 1:
>>>> '1407144048431:FE66CC4054300E4EB2A84138DC9B62B80F59F5B9'
>>>>
>>>>
>>>>
>>>>
>>>> VP Engineering,
>>>> Exonar Ltd
>>>>
>>>> T: +44 7940 567724
>>>>
>>>> twitter:@exonarco @pboichat
>>>> W: http://www.exonar.com
>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>
>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>> and are intended solely for the use of the individual to whom it is
>>>> addressed. Any views or opinions expressed are solely those of the author
>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>> the intended recipient of this email, you must neither take any action
>>>> based upon its contents, nor copy or show it to anyone. Please contact
>>>> the sender if you believe you have received this email in error.
>>>>
>>>> On Fri, Sep 12, 2014 at 4:20 PM, Karl Wright <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Paul,
>>>>>
>>>>> The tests in fact do multiple complete crawls, so it is extremely
>>>>> unlikely that the stuffer query is broken.  If you look at the queries
>>>>> generated, you should note that the only difference is that whenever an 
>>>>> xxx
>>>>> IN(?,?) was generated before, a (xxx=? OR xxx=?) is generated instead.
>>>>> These should be completely equivalent; if they don't look equivalent to 
>>>>> you
>>>>> in the log, then I will fix whatever is broken.  I'll make sure here that
>>>>> the queries look right visually too.
>>>>>
>>>>> One possibility is that when you restarted the agents process, the
>>>>> jobqueue records did not yet finish getting reprioritized.  Stuffer 
>>>>> queries
>>>>> are fired all the time, but the running jobs must complete 
>>>>> reprioritization
>>>>> before the stuffer query will pick up any records.  I wonder if they may
>>>>> not have managed to get to the right state before you aborted the
>>>>> experiment?  You can tell what is happening by using jstack to get a 
>>>>> thread
>>>>> dump of the agents process.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Fri, Sep 12, 2014 at 11:05 AM, Paul Boichat <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I stayed with base 1.6.1 and manually patched the code to include the
>>>>>> two new methods in DBInterfacePostgreSQL
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>> VP Engineering,
>>>>>> Exonar Ltd
>>>>>>
>>>>>> T: +44 7940 567724
>>>>>>
>>>>>> twitter:@exonarco @pboichat
>>>>>> W: http://www.exonar.com
>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>
>>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>>> and are intended solely for the use of the individual to whom it is
>>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>>> the intended recipient of this email, you must neither take any action
>>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>
>>>>>> On Fri, Sep 12, 2014 at 4:01 PM, Karl Wright <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> The changes pass all tests here.  Is it possible that you attempted
>>>>>>> some upgrade that failed (or didn't attempt upgrade but went to a new 
>>>>>>> code
>>>>>>> version)?
>>>>>>>
>>>>>>> If you could let me know as exactly as possible what you did, I can
>>>>>>> let you know if that should have worked or not.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 12, 2014 at 10:57 AM, Paul Boichat <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Karl,
>>>>>>>>
>>>>>>>> We appear to be seeing an issue with the performance change to use
>>>>>>>> an OR clause rather than IN. After making the change, when we restart
>>>>>>>> manifoldcf (with one job in running state) documents in the running 
>>>>>>>> job are
>>>>>>>> not picked up for processing by the stuffer thread. If we redploy base
>>>>>>>> 1.6.1 and restart documents are processed. This is consistently 
>>>>>>>> switchable
>>>>>>>> depending on which code base is deployed.
>>>>>>>>
>>>>>>>> We have logs that I could upload to the ticket if you recommend
>>>>>>>> that we reopen the issue (or create a new one)?
>>>>>>>>
>>>>>>>> Paul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> VP Engineering,
>>>>>>>> Exonar Ltd
>>>>>>>>
>>>>>>>> T: +44 7940 567724
>>>>>>>>
>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>> W: http://www.exonar.com
>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>> <http://video.exonar.com/>
>>>>>>>>
>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>> confidential and are intended solely for the use of the individual to 
>>>>>>>> whom
>>>>>>>> it is addressed. Any views or opinions expressed are solely those of 
>>>>>>>> the
>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>> you are not the intended recipient of this email, you must neither 
>>>>>>>> take any
>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>> contact the sender if you believe you have received this email in 
>>>>>>>> error.
>>>>>>>>
>>>>>>>> On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Paul --
>>>>>>>>>
>>>>>>>>> Just to be clear -- the branch for CONNECTORS-1027 is a branch of
>>>>>>>>> trunk, which is MCF 2.0.  MCF 2.0 is not backwards compatible with any
>>>>>>>>> previous MCF release, and indeed there is no upgrade from any 1.x 
>>>>>>>>> release
>>>>>>>>> to 2.0.  That's why I said to use the patches, and try to stay on 
>>>>>>>>> 1.6.1 or
>>>>>>>>> at most to migrate to 1.7.
>>>>>>>>>
>>>>>>>>> IF you ALREADY tried an upgrade with the branch code, then you
>>>>>>>>> would have wound up in a schema state where the schema had more 
>>>>>>>>> columns in
>>>>>>>>> it than the branch knew how to deal with.  That's bad, and you will 
>>>>>>>>> need to
>>>>>>>>> do things to fix the situation.  I believe you should still be able 
>>>>>>>>> to do
>>>>>>>>> the following:
>>>>>>>>>
>>>>>>>>> - Download 1.7 source, or check out
>>>>>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch
>>>>>>>>> - Apply the patches
>>>>>>>>> - Build
>>>>>>>>> - Modify your properties.xml to point to your postgresql instance
>>>>>>>>> - Run the upgrade (initialize.bat on the multi-process example, or
>>>>>>>>> start the single-process example)
>>>>>>>>>
>>>>>>>>> You should then have a working 1.7 release, with code patches
>>>>>>>>> applied.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks - we've pulled down the branch and will test the changes.
>>>>>>>>>> It looks like a branch of 1.7 so it's going to take us a little 
>>>>>>>>>> while to
>>>>>>>>>> test. We need to migrate our connectors (there's some deprecated 
>>>>>>>>>> stuff
>>>>>>>>>> that's now been cleared in 1.7 .eg. getShareACL) and we'll need to 
>>>>>>>>>> patch
>>>>>>>>>> the database to include the pipeline and any other schema changes. 
>>>>>>>>>> We'll
>>>>>>>>>> have some environment contention over the next week as our 
>>>>>>>>>> performance test
>>>>>>>>>> environment needs to remain on 1.6.1 while we test a release. Once 
>>>>>>>>>> that's
>>>>>>>>>> clear I'll move to 1.7
>>>>>>>>>>
>>>>>>>>>> On the database schema patch moving from 1.6.1 to 1.7 - is there
>>>>>>>>>> a simple way to migrate and existing database?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Paul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> VP Engineering,
>>>>>>>>>> Exonar Ltd
>>>>>>>>>>
>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>
>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>
>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>> confidential and are intended solely for the use of the individual 
>>>>>>>>>> to whom
>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of 
>>>>>>>>>> the
>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>> you are not the intended recipient of this email, you must neither 
>>>>>>>>>> take any
>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>> contact the sender if you believe you have received this email in 
>>>>>>>>>> error.
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks -- I'll include that change as well then, in ticket
>>>>>>>>>>> CONNECTORS-1027.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> That comes back immediately with 10001 rows:
>>>>>>>>>>>>
>>>>>>>>>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue
>>>>>>>>>>>> LIMIT 10001) t;
>>>>>>>>>>>>
>>>>>>>>>>>> QUERY PLAN
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>> ----------------------------------
>>>>>>>>>>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>>>>>>>>>>> time=9.125..9.125 rows=1 loops=1)
>>>>>>>>>>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>>>>>>>>>>> time=0.033..6.945 rows=10001 loops=1)
>>>>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>>>>>>>>>>> =0.031..3.257 rows=10001 loops=1)
>>>>>>>>>>>>                Heap Fetches: 725
>>>>>>>>>>>>  Total runtime: 9.157 ms
>>>>>>>>>>>> (5 rows)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Whereas:
>>>>>>>>>>>>
>>>>>>>>>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>>>>>>>>>>
>>>>>>>>>>>> QUERY PLAN
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>>>>>>>>> time=5225.107..5225.109 rows=1 loops=1)
>>>>>>>>>>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0)
>>>>>>>>>>>> (actual time=5225.105..5225.106 rows=1 loops=1)
>>>>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>>>>>>>>>>> =0.108..3090.848 rows=10370209 loops=1)
>>>>>>>>>>>>                Heap Fetches: 684297
>>>>>>>>>>>>  Total runtime: 5225.151 ms
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Paul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>
>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>
>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>
>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>> confidential and are intended solely for the use of the individual 
>>>>>>>>>>>> to whom
>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those 
>>>>>>>>>>>> of the
>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>> you are not the intended recipient of this email, you must neither 
>>>>>>>>>>>> take any
>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. 
>>>>>>>>>>>> Please
>>>>>>>>>>>> contact the sender if you believe you have received this email in 
>>>>>>>>>>>> error.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Could you try this query on your database please and tell me
>>>>>>>>>>>>> if it executes promptly:
>>>>>>>>>>>>>
>>>>>>>>>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I vaguely remember that I had to change the form of this query
>>>>>>>>>>>>> in order to support MySQL -- but first let's see if this helps.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based
>>>>>>>>>>>>>> branch (branches/CONNECTORS-1027) for looking at any changes we 
>>>>>>>>>>>>>> do for
>>>>>>>>>>>>>> large-scale Postgresql optimization work.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please note that trunk code already has schema changes
>>>>>>>>>>>>>> relative to MCF 1.7, so you will not be able to work directly 
>>>>>>>>>>>>>> with this
>>>>>>>>>>>>>> branch code.  I'll have to create patches for whatever changes 
>>>>>>>>>>>>>> you would
>>>>>>>>>>>>>> need to try.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them
>>>>>>>>>>>>>>> to the doc.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>> confidential and are intended solely for the use of the 
>>>>>>>>>>>>>>> individual to whom
>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely 
>>>>>>>>>>>>>>> those of the
>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must 
>>>>>>>>>>>>>>> neither take any
>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. 
>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>> contact the sender if you believe you have received this email 
>>>>>>>>>>>>>>> in error.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can you include the logged plan for this query; this is an
>>>>>>>>>>>>>>>> actual query encountered during crawling:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>>>>>>>>>>> long-running query (596499 ms): [SELECT 
>>>>>>>>>>>>>>>> t0.id,t0.dochash,t0.docid
>>>>>>>>>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND 
>>>>>>>>>>>>>>>> t1.parentidhash=? AND
>>>>>>>>>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> These queries are all from the UI; it is what gets
>>>>>>>>>>>>>>>> generated when no limits are in place:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) -
>>>>>>>>>>>>>>>> Found a long-running query (166845 ms): [SELECT 
>>>>>>>>>>>>>>>> jobid,COUNT(dochash) AS
>>>>>>>>>>>>>>>> doccount FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) -
>>>>>>>>>>>>>>>> Found a long-running query (107222 ms): [SELECT 
>>>>>>>>>>>>>>>> jobid,COUNT(dochash) AS
>>>>>>>>>>>>>>>> doccount FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) -
>>>>>>>>>>>>>>>> Found a long-running query (254851 ms): [SELECT COUNT(dochash) 
>>>>>>>>>>>>>>>> AS doccount
>>>>>>>>>>>>>>>> FROM jobqueue t1 LIMIT 1000001]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>>>>>>>>>>> sequential scan of the entire table when given a limit clause. 
>>>>>>>>>>>>>>>>  It
>>>>>>>>>>>>>>>> certainly didn't used to do that.  If you have any other 
>>>>>>>>>>>>>>>> suggestions please
>>>>>>>>>>>>>>>> let me know.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Some queries show up in this list because MCF periodically
>>>>>>>>>>>>>>>> reindexes tables.  For example, this query goes only against 
>>>>>>>>>>>>>>>> the (small)
>>>>>>>>>>>>>>>> jobs table.  Its poor performance on occasion is likely due to 
>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>> else happening to the database, probably a reindex:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>>>>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE 
>>>>>>>>>>>>>>>> status IN
>>>>>>>>>>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The final query is the document stuffing query, which is
>>>>>>>>>>>>>>>> perhaps the most critical query in the whole system:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  SELECT
>>>>>>>>>>>>>>>>  t0.id
>>>>>>>>>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>>>>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>>>>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>>>>>>>>>>> t0.checktime
>>>>>>>>>>>>>>>>  <= 1407246846166
>>>>>>>>>>>>>>>>  AND EXISTS (
>>>>>>>>>>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>>>>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>>>>>>>>>>> t1.priority=5
>>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>>>>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>>>>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>>>>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>>>>>>>>>>  LIMIT 480;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Your analysis of whether IN beats OR does not agree with
>>>>>>>>>>>>>>>> experiments I did on postgresql 8.7 which showed no 
>>>>>>>>>>>>>>>> difference.  What
>>>>>>>>>>>>>>>> Postgresql version are you using?  Also, I trust you have 
>>>>>>>>>>>>>>>> query plans that
>>>>>>>>>>>>>>>> demonstrate your claim?  In any case, whether IN vs. OR is 
>>>>>>>>>>>>>>>> generated is a
>>>>>>>>>>>>>>>> function of the MCF database driver, so this is trivial to 
>>>>>>>>>>>>>>>> experiment
>>>>>>>>>>>>>>>> with.  I'll create a ticket and a branch for experimentation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Changing maxcountstatus to something much smaller (10,000)
>>>>>>>>>>>>>>>>> doesn't seem to buy us that much on the table scan - in the 
>>>>>>>>>>>>>>>>> attached you'll
>>>>>>>>>>>>>>>>> see that it's still taking a long time to return the job 
>>>>>>>>>>>>>>>>> status page. Also
>>>>>>>>>>>>>>>>> in the attached are some sample other long running queries 
>>>>>>>>>>>>>>>>> that we're
>>>>>>>>>>>>>>>>> beginning to see more frequently. There's also an example of 
>>>>>>>>>>>>>>>>> a query that's
>>>>>>>>>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a 
>>>>>>>>>>>>>>>>> suggested change
>>>>>>>>>>>>>>>>> to improve performance). This one in particular would 
>>>>>>>>>>>>>>>>> certainly benefit
>>>>>>>>>>>>>>>>> from a change to SSDs which should relieve the I/O bound 
>>>>>>>>>>>>>>>>> bottleneck on
>>>>>>>>>>>>>>>>> postgres.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We're loading the system from 10mil towards 100mil so
>>>>>>>>>>>>>>>>> would be keen to work with you to optimise where possible.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the 
>>>>>>>>>>>>>>>>> individual to whom
>>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely 
>>>>>>>>>>>>>>>>> those of the
>>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. 
>>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must 
>>>>>>>>>>>>>>>>> neither take any
>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to 
>>>>>>>>>>>>>>>>> anyone. Please
>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this 
>>>>>>>>>>>>>>>>> email in error.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The jobstatus query that uses count(*) should be doing
>>>>>>>>>>>>>>>>>> something like this when the maxdocumentstatuscount value is 
>>>>>>>>>>>>>>>>>> set:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This will still do a sequential scan, but it will be an
>>>>>>>>>>>>>>>>>> aborted one, so you can control the maximum amount of time 
>>>>>>>>>>>>>>>>>> spent doing the
>>>>>>>>>>>>>>>>>> query.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop
>>>>>>>>>>>>>>>>>>> it from count(*)-ing but I'll certainly have another look 
>>>>>>>>>>>>>>>>>>> to see if we've
>>>>>>>>>>>>>>>>>>> missed something.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> We're increasingly seeing long running threads and I'll
>>>>>>>>>>>>>>>>>>> put together some samples. As an example, on a job that's 
>>>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>> aborting:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found
>>>>>>>>>>>>>>>>>>> a long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>>>>> Parameter 0: '1.000000001E9'
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>>>>> Parameter 1: '1407144048075'
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>> Plan: Update on jobqueue  (cost=18806.08..445770.39 
>>>>>>>>>>>>>>>>>>> rows=764916 width=287)
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>> Plan:   ->  Bitmap Heap Scan on jobqueue  
>>>>>>>>>>>>>>>>>>> (cost=18806.08..445770.39
>>>>>>>>>>>>>>>>>>> rows=764916 width=287)
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>> Plan:         Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>> Plan:         ->  Bitmap Index Scan on i1392985450177  
>>>>>>>>>>>>>>>>>>> (cost=0.00..18614.85
>>>>>>>>>>>>>>>>>>> rows=764916 width=0)
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>> Plan:               Index Cond: (jobid = 
>>>>>>>>>>>>>>>>>>> 1407144048075::bigint)
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>>>>>>> Stats: n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration
>>>>>>>>>>>>>>>>>>> number 06439969 at 14 West Mills, Newbury, Berkshire,
>>>>>>>>>>>>>>>>>>> RG14 5HG
>>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the 
>>>>>>>>>>>>>>>>>>> individual to whom
>>>>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely 
>>>>>>>>>>>>>>>>>>> those of the
>>>>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar 
>>>>>>>>>>>>>>>>>>> Ltd. If
>>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must 
>>>>>>>>>>>>>>>>>>> neither take any
>>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to 
>>>>>>>>>>>>>>>>>>> anyone. Please
>>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this 
>>>>>>>>>>>>>>>>>>> email in error.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> For the jobqueue scans from the UI, there is a
>>>>>>>>>>>>>>>>>>>> parameter you can set which limits the number of documents 
>>>>>>>>>>>>>>>>>>>> counted to at
>>>>>>>>>>>>>>>>>>>> most a specified amount.  This uses a limit clause, which 
>>>>>>>>>>>>>>>>>>>> should prevent
>>>>>>>>>>>>>>>>>>>> unbounded time doing these kinds of queries:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The documentation says that the default value for this
>>>>>>>>>>>>>>>>>>>> parameter is 10000, which however is incorrect.  The 
>>>>>>>>>>>>>>>>>>>> actual true default is
>>>>>>>>>>>>>>>>>>>> 500000.  You could set that lower for better UI 
>>>>>>>>>>>>>>>>>>>> performance (losing some
>>>>>>>>>>>>>>>>>>>> information, of course.)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> As for long-running queries, a lot of time and effort
>>>>>>>>>>>>>>>>>>>> has been spent in MCF to insure that this doesn't happen.  
>>>>>>>>>>>>>>>>>>>> Specifically,
>>>>>>>>>>>>>>>>>>>> the main document queuing query is structured to read 
>>>>>>>>>>>>>>>>>>>> directly out of a
>>>>>>>>>>>>>>>>>>>> specific jobqueue index.  This is the crucial query that 
>>>>>>>>>>>>>>>>>>>> must work properly
>>>>>>>>>>>>>>>>>>>> for scalability, since doing a query that is effectively 
>>>>>>>>>>>>>>>>>>>> just a sort on the
>>>>>>>>>>>>>>>>>>>> entire jobqueue would be a major problem.  There are some 
>>>>>>>>>>>>>>>>>>>> times where
>>>>>>>>>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, 
>>>>>>>>>>>>>>>>>>>> mostly because it
>>>>>>>>>>>>>>>>>>>> makes a huge distinction between whether there's zero of 
>>>>>>>>>>>>>>>>>>>> something or one
>>>>>>>>>>>>>>>>>>>> of something, but you can work around that particular 
>>>>>>>>>>>>>>>>>>>> issue by setting the
>>>>>>>>>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- 
>>>>>>>>>>>>>>>>>>>> which basically
>>>>>>>>>>>>>>>>>>>> means that reanalysis of the table has to occur on every 
>>>>>>>>>>>>>>>>>>>> stuffing query.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running
>>>>>>>>>>>>>>>>>>>> in your case so that I can see if that is what you are 
>>>>>>>>>>>>>>>>>>>> encountering or not.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> We're beginning to see issues with a document count >
>>>>>>>>>>>>>>>>>>>>> 10 million. At that point, even with good postgres
>>>>>>>>>>>>>>>>>>>>> vacuuming the jobqueue table is starting to become a
>>>>>>>>>>>>>>>>>>>>> bottleneck.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> For example select count(*) from jobqueue, which is
>>>>>>>>>>>>>>>>>>>>> executed when querying job status will do a full table 
>>>>>>>>>>>>>>>>>>>>> scan of
>>>>>>>>>>>>>>>>>>>>> jobqueue which has more than 10 million rows. That's
>>>>>>>>>>>>>>>>>>>>> going to take some time in postgres.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound 
>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>> postgres) but we are increasingly seeing long running
>>>>>>>>>>>>>>>>>>>>> queries in the logs. Our current thinking is that we'll 
>>>>>>>>>>>>>>>>>>>>> need to refactor
>>>>>>>>>>>>>>>>>>>>> JobQueue somewhat to optimise queries and,
>>>>>>>>>>>>>>>>>>>>> potentially partition jobqueue into a subset of
>>>>>>>>>>>>>>>>>>>>> tables (table per queue for example).
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration
>>>>>>>>>>>>>>>>>>>>> number 06439969 at 14 West Mills, Newbury, Berkshire,
>>>>>>>>>>>>>>>>>>>>> RG14 5HG
>>>>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may
>>>>>>>>>>>>>>>>>>>>> be confidential and are intended solely for the use of 
>>>>>>>>>>>>>>>>>>>>> the individual to
>>>>>>>>>>>>>>>>>>>>> whom it is addressed. Any views or opinions expressed are 
>>>>>>>>>>>>>>>>>>>>> solely those of
>>>>>>>>>>>>>>>>>>>>> the author and do not necessarily represent those of 
>>>>>>>>>>>>>>>>>>>>> Exonar Ltd. If
>>>>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you 
>>>>>>>>>>>>>>>>>>>>> must neither take any
>>>>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to 
>>>>>>>>>>>>>>>>>>>>> anyone. Please
>>>>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this 
>>>>>>>>>>>>>>>>>>>>> email in error.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents
>>>>>>>>>>>>>>>>>>>>>> processes or parallel connectors.  Overall database 
>>>>>>>>>>>>>>>>>>>>>> performance is the
>>>>>>>>>>>>>>>>>>>>>> limiting factor.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe
>>>>>>>>>>>>>>>>>>>>>> Chapter 2) that discusses this issue.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million
>>>>>>>>>>>>>>>>>>>>>> web documents, using Postgresql 8.3.  Postgresql 9.x is 
>>>>>>>>>>>>>>>>>>>>>> faster, and with
>>>>>>>>>>>>>>>>>>>>>> modern SSD's, I expect that you will do even better.  In 
>>>>>>>>>>>>>>>>>>>>>> general, I'd say
>>>>>>>>>>>>>>>>>>>>>> it was fine to shoot for 10M - 100M documents on 
>>>>>>>>>>>>>>>>>>>>>> ManifoldCF, provided that
>>>>>>>>>>>>>>>>>>>>>> you use a good database, and provided that you maintain 
>>>>>>>>>>>>>>>>>>>>>> it properly.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I would like to know what is the maximum number of
>>>>>>>>>>>>>>>>>>>>>>> documents that you managed to crawl with ManifoldCF and 
>>>>>>>>>>>>>>>>>>>>>>> with how many
>>>>>>>>>>>>>>>>>>>>>>> connectors in parallel it could works ?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Reply via email to