Re: Apache ManifoldCF Performance

Karl Wright Fri, 12 Sep 2014 09:30:19 -0700

Hi Paul,

The query looks right; the database driver determines the maximum number of
clauses in a conjunction OR list, just like it does for an IN() list.  In
the case of Postgresql and OR, the limit is 25; for IN()'s it's 100.


The standard integration tests generally run small jobs but that is
typically sufficient to find query generation problems.  I have load tests
I can also run but they take several hours to complete.  I'll start one
now, but I may need to abort it before it finishes.

Karl


On Fri, Sep 12, 2014 at 11:26 AM, Paul Boichat <[email protected]>
wrote:

> Hi,
>
> I'm looking through the logs - can see the change from IN to OR in each
> query - and there's clearly a difference in execution path but it's quite
> verbose so will take a while.
>
> It may well be that document state has not been reprioritised or in some
> way inconsistent. However, I don't think it's that which is causing the
> issue - I can switch this behaviour on and off over by changing the
> DBInterfacePostgres class and restarting Manifold. That seems to suggest a
> query isn't behaving the same way between IN and OR - I just can't isolate
> the particular query (yet).
>
> Have you tested with a job already in running state (on a restart) with a
> large document count? For example am seeing this kind of thing which looks
> messy but appears to execute as you'd expect:
>
> SELECT id,dockey,lastversion,lastoutputversion,authorityname,forcedparams
> FROM ingeststatus WHERE  (dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=?) AND
> connectionname=?]
> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 0:
> '1407144048431:F42CD76D66FA6BAD396FF8F8A409DD211C184E6A'
> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 1:
> '1407144048431:FE66CC4054300E4EB2A84138DC9B62B80F59F5B9'
>
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Fri, Sep 12, 2014 at 4:20 PM, Karl Wright <[email protected]> wrote:
>
>> Hi Paul,
>>
>> The tests in fact do multiple complete crawls, so it is extremely
>> unlikely that the stuffer query is broken.  If you look at the queries
>> generated, you should note that the only difference is that whenever an xxx
>> IN(?,?) was generated before, a (xxx=? OR xxx=?) is generated instead.
>> These should be completely equivalent; if they don't look equivalent to you
>> in the log, then I will fix whatever is broken.  I'll make sure here that
>> the queries look right visually too.
>>
>> One possibility is that when you restarted the agents process, the
>> jobqueue records did not yet finish getting reprioritized.  Stuffer queries
>> are fired all the time, but the running jobs must complete reprioritization
>> before the stuffer query will pick up any records.  I wonder if they may
>> not have managed to get to the right state before you aborted the
>> experiment?  You can tell what is happening by using jstack to get a thread
>> dump of the agents process.
>>
>> Thanks,
>> Karl
>>
>>
>> On Fri, Sep 12, 2014 at 11:05 AM, Paul Boichat <[email protected]>
>> wrote:
>>
>>> I stayed with base 1.6.1 and manually patched the code to include the
>>> two new methods in DBInterfacePostgreSQL
>>>
>>> Paul
>>>
>>>
>>>
>>> VP Engineering,
>>> Exonar Ltd
>>>
>>> T: +44 7940 567724
>>>
>>> twitter:@exonarco @pboichat
>>> W: http://www.exonar.com
>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>
>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>> West Mills, Newbury, Berkshire, RG14 5HG
>>> DISCLAIMER: This email and any attachments to it may be confidential
>>> and are intended solely for the use of the individual to whom it is
>>> addressed. Any views or opinions expressed are solely those of the author
>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>> the intended recipient of this email, you must neither take any action
>>> based upon its contents, nor copy or show it to anyone. Please contact
>>> the sender if you believe you have received this email in error.
>>>
>>> On Fri, Sep 12, 2014 at 4:01 PM, Karl Wright <[email protected]> wrote:
>>>
>>>> The changes pass all tests here.  Is it possible that you attempted
>>>> some upgrade that failed (or didn't attempt upgrade but went to a new code
>>>> version)?
>>>>
>>>> If you could let me know as exactly as possible what you did, I can let
>>>> you know if that should have worked or not.
>>>>
>>>> Thanks!
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Sep 12, 2014 at 10:57 AM, Paul Boichat <[email protected]
>>>> > wrote:
>>>>
>>>>> Karl,
>>>>>
>>>>> We appear to be seeing an issue with the performance change to use an
>>>>> OR clause rather than IN. After making the change, when we restart
>>>>> manifoldcf (with one job in running state) documents in the running job 
>>>>> are
>>>>> not picked up for processing by the stuffer thread. If we redploy base
>>>>> 1.6.1 and restart documents are processed. This is consistently switchable
>>>>> depending on which code base is deployed.
>>>>>
>>>>> We have logs that I could upload to the ticket if you recommend that
>>>>> we reopen the issue (or create a new one)?
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> VP Engineering,
>>>>> Exonar Ltd
>>>>>
>>>>> T: +44 7940 567724
>>>>>
>>>>> twitter:@exonarco @pboichat
>>>>> W: http://www.exonar.com
>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>
>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>> and are intended solely for the use of the individual to whom it is
>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>> the intended recipient of this email, you must neither take any action
>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>> contact the sender if you believe you have received this email in error.
>>>>>
>>>>> On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Paul --
>>>>>>
>>>>>> Just to be clear -- the branch for CONNECTORS-1027 is a branch of
>>>>>> trunk, which is MCF 2.0.  MCF 2.0 is not backwards compatible with any
>>>>>> previous MCF release, and indeed there is no upgrade from any 1.x release
>>>>>> to 2.0.  That's why I said to use the patches, and try to stay on 1.6.1 
>>>>>> or
>>>>>> at most to migrate to 1.7.
>>>>>>
>>>>>> IF you ALREADY tried an upgrade with the branch code, then you would
>>>>>> have wound up in a schema state where the schema had more columns in it
>>>>>> than the branch knew how to deal with.  That's bad, and you will need to 
>>>>>> do
>>>>>> things to fix the situation.  I believe you should still be able to do 
>>>>>> the
>>>>>> following:
>>>>>>
>>>>>> - Download 1.7 source, or check out
>>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch
>>>>>> - Apply the patches
>>>>>> - Build
>>>>>> - Modify your properties.xml to point to your postgresql instance
>>>>>> - Run the upgrade (initialize.bat on the multi-process example, or
>>>>>> start the single-process example)
>>>>>>
>>>>>> You should then have a working 1.7 release, with code patches applied.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Thanks - we've pulled down the branch and will test the changes. It
>>>>>>> looks like a branch of 1.7 so it's going to take us a little while to 
>>>>>>> test.
>>>>>>> We need to migrate our connectors (there's some deprecated stuff that's 
>>>>>>> now
>>>>>>> been cleared in 1.7 .eg. getShareACL) and we'll need to patch the 
>>>>>>> database
>>>>>>> to include the pipeline and any other schema changes. We'll have some
>>>>>>> environment contention over the next week as our performance test
>>>>>>> environment needs to remain on 1.6.1 while we test a release. Once 
>>>>>>> that's
>>>>>>> clear I'll move to 1.7
>>>>>>>
>>>>>>> On the database schema patch moving from 1.6.1 to 1.7 - is there a
>>>>>>> simple way to migrate and existing database?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> VP Engineering,
>>>>>>> Exonar Ltd
>>>>>>>
>>>>>>> T: +44 7940 567724
>>>>>>>
>>>>>>> twitter:@exonarco @pboichat
>>>>>>> W: http://www.exonar.com
>>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>>
>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>> confidential and are intended solely for the use of the individual to 
>>>>>>> whom
>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>> author and do not necessarily represent those of Exonar Ltd. If you
>>>>>>> are not the intended recipient of this email, you must neither take any
>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>
>>>>>>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks -- I'll include that change as well then, in ticket
>>>>>>>> CONNECTORS-1027.
>>>>>>>>
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> That comes back immediately with 10001 rows:
>>>>>>>>>
>>>>>>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue
>>>>>>>>> LIMIT 10001) t;
>>>>>>>>>
>>>>>>>>> QUERY PLAN
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>> ----------------------------------
>>>>>>>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>>>>>>>> time=9.125..9.125 rows=1 loops=1)
>>>>>>>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>>>>>>>> time=0.033..6.945 rows=10001 loops=1)
>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>>>>>>>> =0.031..3.257 rows=10001 loops=1)
>>>>>>>>>                Heap Fetches: 725
>>>>>>>>>  Total runtime: 9.157 ms
>>>>>>>>> (5 rows)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Whereas:
>>>>>>>>>
>>>>>>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>>>>>>>
>>>>>>>>> QUERY PLAN
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>> ----------------------------------------
>>>>>>>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>>>>>> time=5225.107..5225.109 rows=1 loops=1)
>>>>>>>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0)
>>>>>>>>> (actual time=5225.105..5225.106 rows=1 loops=1)
>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>>>>>>>> =0.108..3090.848 rows=10370209 loops=1)
>>>>>>>>>                Heap Fetches: 684297
>>>>>>>>>  Total runtime: 5225.151 ms
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Paul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> VP Engineering,
>>>>>>>>> Exonar Ltd
>>>>>>>>>
>>>>>>>>> T: +44 7940 567724
>>>>>>>>>
>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>> W: http://www.exonar.com
>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>
>>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>> confidential and are intended solely for the use of the individual to 
>>>>>>>>> whom
>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of 
>>>>>>>>> the
>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>> you are not the intended recipient of this email, you must neither 
>>>>>>>>> take any
>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>> contact the sender if you believe you have received this email in 
>>>>>>>>> error.
>>>>>>>>>
>>>>>>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Paul,
>>>>>>>>>>
>>>>>>>>>> Could you try this query on your database please and tell me if
>>>>>>>>>> it executes promptly:
>>>>>>>>>>
>>>>>>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I vaguely remember that I had to change the form of this query in
>>>>>>>>>> order to support MySQL -- but first let's see if this helps.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
>>>>>>>>>>> (branches/CONNECTORS-1027) for looking at any changes we do for 
>>>>>>>>>>> large-scale
>>>>>>>>>>> Postgresql optimization work.
>>>>>>>>>>>
>>>>>>>>>>> Please note that trunk code already has schema changes relative
>>>>>>>>>>> to MCF 1.7, so you will not be able to work directly with this 
>>>>>>>>>>> branch
>>>>>>>>>>> code.  I'll have to create patches for whatever changes you would 
>>>>>>>>>>> need to
>>>>>>>>>>> try.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them to
>>>>>>>>>>>> the doc.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>> Paul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>
>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>
>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>
>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>> confidential and are intended solely for the use of the individual 
>>>>>>>>>>>> to whom
>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those 
>>>>>>>>>>>> of the
>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>> you are not the intended recipient of this email, you must neither 
>>>>>>>>>>>> take any
>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. 
>>>>>>>>>>>> Please
>>>>>>>>>>>> contact the sender if you believe you have received this email in 
>>>>>>>>>>>> error.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you include the logged plan for this query; this is an
>>>>>>>>>>>>> actual query encountered during crawling:
>>>>>>>>>>>>>
>>>>>>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND 
>>>>>>>>>>>>> t1.parentidhash=? AND
>>>>>>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> These queries are all from the UI; it is what gets generated
>>>>>>>>>>>>> when no limits are in place:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>>>>>>>>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS 
>>>>>>>>>>>>> doccount
>>>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>>>>>>>>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS 
>>>>>>>>>>>>> doccount
>>>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>
>>>>>>>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>>>>>>>
>>>>>>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>>>>>>>>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS 
>>>>>>>>>>>>> doccount FROM
>>>>>>>>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>>>>>>>>
>>>>>>>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>>>>>>>> sequential scan of the entire table when given a limit clause.  It
>>>>>>>>>>>>> certainly didn't used to do that.  If you have any other 
>>>>>>>>>>>>> suggestions please
>>>>>>>>>>>>> let me know.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Some queries show up in this list because MCF periodically
>>>>>>>>>>>>> reindexes tables.  For example, this query goes only against the 
>>>>>>>>>>>>> (small)
>>>>>>>>>>>>> jobs table.  Its poor performance on occasion is likely due to 
>>>>>>>>>>>>> something
>>>>>>>>>>>>> else happening to the database, probably a reindex:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status 
>>>>>>>>>>>>> IN
>>>>>>>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The final query is the document stuffing query, which is
>>>>>>>>>>>>> perhaps the most critical query in the whole system:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  SELECT
>>>>>>>>>>>>>  t0.id
>>>>>>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>>>>>>>> t0.checktime
>>>>>>>>>>>>>  <= 1407246846166
>>>>>>>>>>>>>  AND EXISTS (
>>>>>>>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>>>>>>>> t1.priority=5
>>>>>>>>>>>>>  )
>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>>>>>>>  )
>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>>>>>>>  )
>>>>>>>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>>>>>>>  LIMIT 480;
>>>>>>>>>>>>>
>>>>>>>>>>>>> Your analysis of whether IN beats OR does not agree with
>>>>>>>>>>>>> experiments I did on postgresql 8.7 which showed no difference.  
>>>>>>>>>>>>> What
>>>>>>>>>>>>> Postgresql version are you using?  Also, I trust you have query 
>>>>>>>>>>>>> plans that
>>>>>>>>>>>>> demonstrate your claim?  In any case, whether IN vs. OR is 
>>>>>>>>>>>>> generated is a
>>>>>>>>>>>>> function of the MCF database driver, so this is trivial to 
>>>>>>>>>>>>> experiment
>>>>>>>>>>>>> with.  I'll create a ticket and a branch for experimentation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Changing maxcountstatus to something much smaller (10,000)
>>>>>>>>>>>>>> doesn't seem to buy us that much on the table scan - in the 
>>>>>>>>>>>>>> attached you'll
>>>>>>>>>>>>>> see that it's still taking a long time to return the job status 
>>>>>>>>>>>>>> page. Also
>>>>>>>>>>>>>> in the attached are some sample other long running queries that 
>>>>>>>>>>>>>> we're
>>>>>>>>>>>>>> beginning to see more frequently. There's also an example of a 
>>>>>>>>>>>>>> query that's
>>>>>>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a 
>>>>>>>>>>>>>> suggested change
>>>>>>>>>>>>>> to improve performance). This one in particular would certainly 
>>>>>>>>>>>>>> benefit
>>>>>>>>>>>>>> from a change to SSDs which should relieve the I/O bound 
>>>>>>>>>>>>>> bottleneck on
>>>>>>>>>>>>>> postgres.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We're loading the system from 10mil towards 100mil so would
>>>>>>>>>>>>>> be keen to work with you to optimise where possible.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>> confidential and are intended solely for the use of the 
>>>>>>>>>>>>>> individual to whom
>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely 
>>>>>>>>>>>>>> those of the
>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>> you are not the intended recipient of this email, you must 
>>>>>>>>>>>>>> neither take any
>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. 
>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>> contact the sender if you believe you have received this email 
>>>>>>>>>>>>>> in error.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The jobstatus query that uses count(*) should be doing
>>>>>>>>>>>>>>> something like this when the maxdocumentstatuscount value is 
>>>>>>>>>>>>>>> set:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This will still do a sequential scan, but it will be an
>>>>>>>>>>>>>>> aborted one, so you can control the maximum amount of time 
>>>>>>>>>>>>>>> spent doing the
>>>>>>>>>>>>>>> query.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it
>>>>>>>>>>>>>>>> from count(*)-ing but I'll certainly have another look to see 
>>>>>>>>>>>>>>>> if we've
>>>>>>>>>>>>>>>> missed something.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We're increasingly seeing long running threads and I'll put
>>>>>>>>>>>>>>>> together some samples. As an example, on a job that's 
>>>>>>>>>>>>>>>> currently aborting:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>> Parameter 0: '1.000000001E9'
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>> Parameter 1: '1407144048075'
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>>>>>>>> Update on jobqueue  (cost=18806.08..445770.39 rows=764916 
>>>>>>>>>>>>>>>> width=287)
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>>>>>>>> ->  Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 
>>>>>>>>>>>>>>>> rows=764916
>>>>>>>>>>>>>>>> width=287)
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>> Plan:         Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>> Plan:         ->  Bitmap Index Scan on i1392985450177  
>>>>>>>>>>>>>>>> (cost=0.00..18614.85
>>>>>>>>>>>>>>>> rows=764916 width=0)
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>>>>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the 
>>>>>>>>>>>>>>>> individual to whom
>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely 
>>>>>>>>>>>>>>>> those of the
>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must 
>>>>>>>>>>>>>>>> neither take any
>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. 
>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email 
>>>>>>>>>>>>>>>> in error.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter
>>>>>>>>>>>>>>>>> you can set which limits the number of documents counted to 
>>>>>>>>>>>>>>>>> at most a
>>>>>>>>>>>>>>>>> specified amount.  This uses a limit clause, which should 
>>>>>>>>>>>>>>>>> prevent unbounded
>>>>>>>>>>>>>>>>> time doing these kinds of queries:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The documentation says that the default value for this
>>>>>>>>>>>>>>>>> parameter is 10000, which however is incorrect.  The actual 
>>>>>>>>>>>>>>>>> true default is
>>>>>>>>>>>>>>>>> 500000.  You could set that lower for better UI performance 
>>>>>>>>>>>>>>>>> (losing some
>>>>>>>>>>>>>>>>> information, of course.)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> As for long-running queries, a lot of time and effort has
>>>>>>>>>>>>>>>>> been spent in MCF to insure that this doesn't happen.  
>>>>>>>>>>>>>>>>> Specifically, the
>>>>>>>>>>>>>>>>> main document queuing query is structured to read directly 
>>>>>>>>>>>>>>>>> out of a
>>>>>>>>>>>>>>>>> specific jobqueue index.  This is the crucial query that must 
>>>>>>>>>>>>>>>>> work properly
>>>>>>>>>>>>>>>>> for scalability, since doing a query that is effectively just 
>>>>>>>>>>>>>>>>> a sort on the
>>>>>>>>>>>>>>>>> entire jobqueue would be a major problem.  There are some 
>>>>>>>>>>>>>>>>> times where
>>>>>>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, 
>>>>>>>>>>>>>>>>> mostly because it
>>>>>>>>>>>>>>>>> makes a huge distinction between whether there's zero of 
>>>>>>>>>>>>>>>>> something or one
>>>>>>>>>>>>>>>>> of something, but you can work around that particular issue 
>>>>>>>>>>>>>>>>> by setting the
>>>>>>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which 
>>>>>>>>>>>>>>>>> basically
>>>>>>>>>>>>>>>>> means that reanalysis of the table has to occur on every 
>>>>>>>>>>>>>>>>> stuffing query.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running in
>>>>>>>>>>>>>>>>> your case so that I can see if that is what you are 
>>>>>>>>>>>>>>>>> encountering or not.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We're beginning to see issues with a document count > 10
>>>>>>>>>>>>>>>>>> million. At that point, even with good postgres
>>>>>>>>>>>>>>>>>> vacuuming the jobqueue table is starting to become a
>>>>>>>>>>>>>>>>>> bottleneck.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For example select count(*) from jobqueue, which is
>>>>>>>>>>>>>>>>>> executed when querying job status will do a full table scan 
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> jobqueue which has more than 10 million rows. That's
>>>>>>>>>>>>>>>>>> going to take some time in postgres.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in
>>>>>>>>>>>>>>>>>> postgres) but we are increasingly seeing long running
>>>>>>>>>>>>>>>>>> queries in the logs. Our current thinking is that we'll need 
>>>>>>>>>>>>>>>>>> to refactor
>>>>>>>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially
>>>>>>>>>>>>>>>>>> partition jobqueue into a subset of tables (table per
>>>>>>>>>>>>>>>>>> queue for example).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the 
>>>>>>>>>>>>>>>>>> individual to whom
>>>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely 
>>>>>>>>>>>>>>>>>> those of the
>>>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. 
>>>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must 
>>>>>>>>>>>>>>>>>> neither take any
>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to 
>>>>>>>>>>>>>>>>>> anyone. Please
>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this 
>>>>>>>>>>>>>>>>>> email in error.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents
>>>>>>>>>>>>>>>>>>> processes or parallel connectors.  Overall database 
>>>>>>>>>>>>>>>>>>> performance is the
>>>>>>>>>>>>>>>>>>> limiting factor.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter
>>>>>>>>>>>>>>>>>>> 2) that discusses this issue.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million
>>>>>>>>>>>>>>>>>>> web documents, using Postgresql 8.3.  Postgresql 9.x is 
>>>>>>>>>>>>>>>>>>> faster, and with
>>>>>>>>>>>>>>>>>>> modern SSD's, I expect that you will do even better.  In 
>>>>>>>>>>>>>>>>>>> general, I'd say
>>>>>>>>>>>>>>>>>>> it was fine to shoot for 10M - 100M documents on 
>>>>>>>>>>>>>>>>>>> ManifoldCF, provided that
>>>>>>>>>>>>>>>>>>> you use a good database, and provided that you maintain it 
>>>>>>>>>>>>>>>>>>> properly.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I would like to know what is the maximum number of
>>>>>>>>>>>>>>>>>>>> documents that you managed to crawl with ManifoldCF and 
>>>>>>>>>>>>>>>>>>>> with how many
>>>>>>>>>>>>>>>>>>>> connectors in parallel it could works ?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Reply via email to