Re: Apache ManifoldCF Performance

Karl Wright Thu, 11 Sep 2014 05:29:01 -0700

Thanks -- I'll include that change as well then, in ticket CONNECTORS-1027.



Karl


On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <[email protected]>
wrote:

> Hi,
>
> That comes back immediately with 10001 rows:
>
> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT
> 10001) t;
>
> QUERY PLAN
>
>
> -----------------------------------------------------------------------------------------------------------------------
> ----------------------------------
>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual time=9.125..9.125
> rows=1 loops=1)
>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
> time=0.033..6.945 rows=10001 loops=1)
>          ->  Index Only Scan using jobqueue_pkey on jobqueue
> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
> =0.031..3.257 rows=10001 loops=1)
>                Heap Fetches: 725
>  Total runtime: 9.157 ms
> (5 rows)
>
>
> Whereas:
>
> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>
> QUERY PLAN
>
>
> -----------------------------------------------------------------------------------------------------------------------
> ----------------------------------------
>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
> time=5225.107..5225.109 rows=1 loops=1)
>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0) (actual
> time=5225.105..5225.106 rows=1 loops=1)
>          ->  Index Only Scan using jobqueue_pkey on jobqueue
> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
> =0.108..3090.848 rows=10370209 loops=1)
>                Heap Fetches: 684297
>  Total runtime: 5225.151 ms
>
> Thanks,
>
> Paul
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <[email protected]> wrote:
>
>> Hi Paul,
>>
>> Could you try this query on your database please and tell me if it
>> executes promptly:
>>
>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>
>>
>> I vaguely remember that I had to change the form of this query in order
>> to support MySQL -- but first let's see if this helps.
>>
>> Karl
>>
>>
>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <[email protected]> wrote:
>>
>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
>>> (branches/CONNECTORS-1027) for looking at any changes we do for large-scale
>>> Postgresql optimization work.
>>>
>>> Please note that trunk code already has schema changes relative to MCF
>>> 1.7, so you will not be able to work directly with this branch code.  I'll
>>> have to create patches for whatever changes you would need to try.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> We're on Postgres 9.2. I'll get the query plans and add them to the doc.
>>>>
>>>> Thanks
>>>>
>>>> Paul
>>>>
>>>>
>>>>
>>>> VP Engineering,
>>>> Exonar Ltd
>>>>
>>>> T: +44 7940 567724
>>>>
>>>> twitter:@exonarco @pboichat
>>>> W: http://www.exonar.com
>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>
>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>> and are intended solely for the use of the individual to whom it is
>>>> addressed. Any views or opinions expressed are solely those of the author
>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>> the intended recipient of this email, you must neither take any action
>>>> based upon its contents, nor copy or show it to anyone. Please contact
>>>> the sender if you believe you have received this email in error.
>>>>
>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Paul,
>>>>>
>>>>> Can you include the logged plan for this query; this is an actual
>>>>> query encountered during crawling:
>>>>>
>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>
>>>>>
>>>>> These queries are all from the UI; it is what gets generated when no
>>>>> limits are in place:
>>>>>
>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>
>>>>> This query is from the UI with a limit of 1000000:
>>>>>
>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>>>> jobqueue t1 LIMIT 1000001]
>>>>>
>>>>> I honestly don't understand why PostgreSQL would execute a sequential
>>>>> scan of the entire table when given a limit clause.  It certainly didn't
>>>>> used to do that.  If you have any other suggestions please let me know.
>>>>>
>>>>>
>>>>>
>>>>> Some queries show up in this list because MCF periodically reindexes
>>>>> tables.  For example, this query goes only against the (small) jobs table.
>>>>> Its poor performance on occasion is likely due to something else happening
>>>>> to the database, probably a reindex:
>>>>>
>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a long-running
>>>>> query (592474 ms): [SELECT id FROM jobs WHERE status IN (?,?,?,?,?) FOR
>>>>> UPDATE]
>>>>>
>>>>>
>>>>> The final query is the document stuffing query, which is perhaps the
>>>>> most critical query in the whole system:
>>>>>
>>>>>  SELECT
>>>>>  t0.id
>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>  t0.priorityset FROM jobqueue t0
>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND t0.checktime
>>>>>  <= 1407246846166
>>>>>  AND EXISTS (
>>>>>    SELECT 'x' FROM jobs t1
>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>> t1.priority=5
>>>>>  )
>>>>>  AND NOT EXISTS (
>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>  )
>>>>>  AND NOT EXISTS (
>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>  )
>>>>>  ORDER BY t0.docpriority ASC
>>>>>  LIMIT 480;
>>>>>
>>>>> Your analysis of whether IN beats OR does not agree with experiments I
>>>>> did on postgresql 8.7 which showed no difference.  What Postgresql version
>>>>> are you using?  Also, I trust you have query plans that demonstrate your
>>>>> claim?  In any case, whether IN vs. OR is generated is a function of the
>>>>> MCF database driver, so this is trivial to experiment with.  I'll create a
>>>>> ticket and a branch for experimentation.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <[email protected]
>>>>> > wrote:
>>>>>
>>>>>> Hi Karl,
>>>>>>
>>>>>> Changing maxcountstatus to something much smaller (10,000) doesn't
>>>>>> seem to buy us that much on the table scan - in the attached you'll see
>>>>>> that it's still taking a long time to return the job status page. Also in
>>>>>> the attached are some sample other long running queries that we're
>>>>>> beginning to see more frequently. There's also an example of a query 
>>>>>> that's
>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change
>>>>>> to improve performance). This one in particular would certainly benefit
>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>>> postgres.
>>>>>>
>>>>>> We're loading the system from 10mil towards 100mil so would be keen
>>>>>> to work with you to optimise where possible.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> VP Engineering,
>>>>>> Exonar Ltd
>>>>>>
>>>>>> T: +44 7940 567724
>>>>>>
>>>>>> twitter:@exonarco @pboichat
>>>>>> W: http://www.exonar.com
>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>
>>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>>> and are intended solely for the use of the individual to whom it is
>>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>>> the intended recipient of this email, you must neither take any action
>>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>
>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> The jobstatus query that uses count(*) should be doing something
>>>>>>> like this when the maxdocumentstatuscount value is set:
>>>>>>>
>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>
>>>>>>> This will still do a sequential scan, but it will be an aborted one,
>>>>>>> so you can control the maximum amount of time spent doing the query.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We've had a play with maxstatuscount and couldn't stop it from
>>>>>>>> count(*)-ing but I'll certainly have another look to see if we've 
>>>>>>>> missed
>>>>>>>> something.
>>>>>>>>
>>>>>>>> We're increasingly seeing long running threads and I'll put
>>>>>>>> together some samples. As an example, on a job that's currently 
>>>>>>>> aborting:
>>>>>>>>
>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 0:
>>>>>>>> '1.000000001E9'
>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 1:
>>>>>>>> '1407144048075'
>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update on
>>>>>>>> jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->
>>>>>>>> Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916
>>>>>>>> width=287)
>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>> Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>> ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85 
>>>>>>>> rows=764916
>>>>>>>> width=0)
>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>
>>>>>>>> Paul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> VP Engineering,
>>>>>>>> Exonar Ltd
>>>>>>>>
>>>>>>>> T: +44 7940 567724
>>>>>>>>
>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>> W: http://www.exonar.com
>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>> <http://video.exonar.com/>
>>>>>>>>
>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>> confidential and are intended solely for the use of the individual to 
>>>>>>>> whom
>>>>>>>> it is addressed. Any views or opinions expressed are solely those of 
>>>>>>>> the
>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>> you are not the intended recipient of this email, you must neither 
>>>>>>>> take any
>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>> contact the sender if you believe you have received this email in 
>>>>>>>> error.
>>>>>>>>
>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Paul,
>>>>>>>>>
>>>>>>>>> For the jobqueue scans from the UI, there is a parameter you can
>>>>>>>>> set which limits the number of documents counted to at most a 
>>>>>>>>> specified
>>>>>>>>> amount.  This uses a limit clause, which should prevent unbounded time
>>>>>>>>> doing these kinds of queries:
>>>>>>>>>
>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>
>>>>>>>>> The documentation says that the default value for this parameter
>>>>>>>>> is 10000, which however is incorrect.  The actual true default is 
>>>>>>>>> 500000.
>>>>>>>>> You could set that lower for better UI performance (losing some
>>>>>>>>> information, of course.)
>>>>>>>>>
>>>>>>>>> As for long-running queries, a lot of time and effort has been
>>>>>>>>> spent in MCF to insure that this doesn't happen.  Specifically, the 
>>>>>>>>> main
>>>>>>>>> document queuing query is structured to read directly out of a 
>>>>>>>>> specific
>>>>>>>>> jobqueue index.  This is the crucial query that must work properly for
>>>>>>>>> scalability, since doing a query that is effectively just a sort on 
>>>>>>>>> the
>>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly 
>>>>>>>>> because it
>>>>>>>>> makes a huge distinction between whether there's zero of something or 
>>>>>>>>> one
>>>>>>>>> of something, but you can work around that particular issue by 
>>>>>>>>> setting the
>>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>>> means that reanalysis of the table has to occur on every stuffing 
>>>>>>>>> query.
>>>>>>>>>
>>>>>>>>> I'd appreciate seeing the queries that are long-running in your
>>>>>>>>> case so that I can see if that is what you are encountering or not.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Karl,
>>>>>>>>>>
>>>>>>>>>> We're beginning to see issues with a document count > 10 million.
>>>>>>>>>> At that point, even with good postgres vacuuming the jobqueue
>>>>>>>>>> table is starting to become a bottleneck.
>>>>>>>>>>
>>>>>>>>>> For example select count(*) from jobqueue, which is executed
>>>>>>>>>> when querying job status will do a full table scan of jobqueue
>>>>>>>>>> which has more than 10 million rows. That's going to take some time 
>>>>>>>>>> in
>>>>>>>>>> postgres.
>>>>>>>>>>
>>>>>>>>>> SSDs will certainly make a big difference to document processing
>>>>>>>>>> through-put (which we see is largely I/O bound in postgres) but
>>>>>>>>>> we are increasingly seeing long running queries in the logs. Our 
>>>>>>>>>> current
>>>>>>>>>> thinking is that we'll need to refactor JobQueue somewhat to
>>>>>>>>>> optimise queries and, potentially partition jobqueue into a
>>>>>>>>>> subset of tables (table per queue for example).
>>>>>>>>>>
>>>>>>>>>> Paul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> VP Engineering,
>>>>>>>>>> Exonar Ltd
>>>>>>>>>>
>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>
>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>
>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>> confidential and are intended solely for the use of the individual 
>>>>>>>>>> to whom
>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of 
>>>>>>>>>> the
>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>> you are not the intended recipient of this email, you must neither 
>>>>>>>>>> take any
>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>> contact the sender if you believe you have received this email in 
>>>>>>>>>> error.
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>
>>>>>>>>>>> ManifoldCF is not limited by the number of agents processes or
>>>>>>>>>>> parallel connectors.  Overall database performance is the limiting 
>>>>>>>>>>> factor.
>>>>>>>>>>>
>>>>>>>>>>> I would read this:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>
>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2) that
>>>>>>>>>>> discusses this issue.
>>>>>>>>>>>
>>>>>>>>>>> Some five years ago, I successfully crawled 5 million web
>>>>>>>>>>> documents, using Postgresql 8.3.  Postgresql 9.x is faster, and 
>>>>>>>>>>> with modern
>>>>>>>>>>> SSD's, I expect that you will do even better.  In general, I'd say 
>>>>>>>>>>> it was
>>>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided that 
>>>>>>>>>>> you use
>>>>>>>>>>> a good database, and provided that you maintain it properly.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi
>>>>>>>>>>>>
>>>>>>>>>>>> I would like to know what is the maximum number of documents
>>>>>>>>>>>> that you managed to crawl with ManifoldCF and with how many 
>>>>>>>>>>>> connectors in
>>>>>>>>>>>> parallel it could works ?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>
>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Reply via email to