Re: Apache ManifoldCF Performance

Paul Boichat Thu, 11 Sep 2014 02:33:08 -0700

Hi Karl,

Changing maxcountstatus to something much smaller (10,000) doesn't seem to
buy us that much on the table scan - in the attached you'll see that it's
still taking a long time to return the job status page. Also in the
attached are some sample other long running queries that we're beginning to
see more frequently. There's also an example of a query that's frequently
executed and regularly takes > 4 secs (plus a suggested change to improve
performance). This one in particular would certainly benefit from a change
to SSDs which should relieve the I/O bound bottleneck on postgres.


We're loading the system from 10mil towards 100mil so would be keen to work
with you to optimise where possible.

Thanks,

Paul




VP Engineering,
Exonar Ltd

T: +44 7940 567724

twitter:@exonarco @pboichat
W: http://www.exonar.com
Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>

Exonar Limited, registered in the UK, registration number 06439969 at 14
West Mills, Newbury, Berkshire, RG14 5HG
DISCLAIMER: This email and any attachments to it may be confidential and
are intended solely for the use of the individual to whom it is addressed.
Any views or opinions expressed are solely those of the author and do not
necessarily represent those of Exonar Ltd. If you are not the intended
recipient of this email, you must neither take any action based upon its
contents, nor copy or show it to anyone. Please contact the sender if you
believe you have received this email in error.

On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <[email protected]> wrote:

> Hi Paul,
>
> The jobstatus query that uses count(*) should be doing something like this
> when the maxdocumentstatuscount value is set:
>
> select count(*) from jobqueue where xxx limit 500001
>
> This will still do a sequential scan, but it will be an aborted one, so
> you can control the maximum amount of time spent doing the query.
>
> Karl
>
>
> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <[email protected]>
> wrote:
>
>> Hi,
>>
>> We've had a play with maxstatuscount and couldn't stop it from
>> count(*)-ing but I'll certainly have another look to see if we've missed
>> something.
>>
>> We're increasingly seeing long running threads and I'll put together some
>> samples. As an example, on a job that's currently aborting:
>>
>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a long-running
>> query (72902 ms): [UPDATE jobqueue SET docpriority=?,priorityset=NULL WHERE
>> jobid=?]
>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 0:
>> '1.000000001E9'
>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 1:
>> '1407144048075'
>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update on
>> jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->  Bitmap
>> Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:         Recheck
>> Cond: (jobid = 1407144048075::bigint)
>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:         ->
>> Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85 rows=764916
>> width=0)
>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>> Index Cond: (jobid = 1407144048075::bigint)
>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats: n_distinct=4.0
>> most_common_vals={G,C,Z,P}
>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>
>> Paul
>>
>>
>>
>> VP Engineering,
>> Exonar Ltd
>>
>> T: +44 7940 567724
>>
>> twitter:@exonarco @pboichat
>> W: http://www.exonar.com
>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>
>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>> West Mills, Newbury, Berkshire, RG14 5HG
>> DISCLAIMER: This email and any attachments to it may be confidential and
>> are intended solely for the use of the individual to whom it is addressed.
>> Any views or opinions expressed are solely those of the author and do not
>> necessarily represent those of Exonar Ltd. If you are not the intended
>> recipient of this email, you must neither take any action based upon its
>> contents, nor copy or show it to anyone. Please contact the sender if
>> you believe you have received this email in error.
>>
>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <[email protected]> wrote:
>>
>>> Hi Paul,
>>>
>>> For the jobqueue scans from the UI, there is a parameter you can set
>>> which limits the number of documents counted to at most a specified
>>> amount.  This uses a limit clause, which should prevent unbounded time
>>> doing these kinds of queries:
>>>
>>> org.apache.manifoldcf.ui.maxstatuscount
>>>
>>> The documentation says that the default value for this parameter is
>>> 10000, which however is incorrect.  The actual true default is 500000.  You
>>> could set that lower for better UI performance (losing some information, of
>>> course.)
>>>
>>> As for long-running queries, a lot of time and effort has been spent in
>>> MCF to insure that this doesn't happen.  Specifically, the main document
>>> queuing query is structured to read directly out of a specific jobqueue
>>> index.  This is the crucial query that must work properly for scalability,
>>> since doing a query that is effectively just a sort on the entire jobqueue
>>> would be a major problem.  There are some times where Postgresql's
>>> optimizer fails to do the right thing here, mostly because it makes a huge
>>> distinction between whether there's zero of something or one of something,
>>> but you can work around that particular issue by setting the analyze count
>>> to 1 if you start to see this problem -- which basically means that
>>> reanalysis of the table has to occur on every stuffing query.
>>>
>>> I'd appreciate seeing the queries that are long-running in your case so
>>> that I can see if that is what you are encountering or not.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>>
>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <[email protected]>
>>> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> We're beginning to see issues with a document count > 10 million. At
>>>> that point, even with good postgres vacuuming the jobqueue table is
>>>> starting to become a bottleneck.
>>>>
>>>> For example select count(*) from jobqueue, which is executed when
>>>> querying job status will do a full table scan of jobqueue which has
>>>> more than 10 million rows. That's going to take some time in postgres.
>>>>
>>>> SSDs will certainly make a big difference to document processing
>>>> through-put (which we see is largely I/O bound in postgres) but we are
>>>> increasingly seeing long running queries in the logs. Our current thinking
>>>> is that we'll need to refactor JobQueue somewhat to optimise queries
>>>> and, potentially partition jobqueue into a subset of tables (table per
>>>> queue for example).
>>>>
>>>> Paul
>>>>
>>>>
>>>>
>>>> VP Engineering,
>>>> Exonar Ltd
>>>>
>>>> T: +44 7940 567724
>>>>
>>>> twitter:@exonarco @pboichat
>>>> W: http://www.exonar.com
>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>
>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>> and are intended solely for the use of the individual to whom it is
>>>> addressed. Any views or opinions expressed are solely those of the author
>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>> the intended recipient of this email, you must neither take any action
>>>> based upon its contents, nor copy or show it to anyone. Please contact
>>>> the sender if you believe you have received this email in error.
>>>>
>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Baptiste,
>>>>>
>>>>> ManifoldCF is not limited by the number of agents processes or
>>>>> parallel connectors.  Overall database performance is the limiting factor.
>>>>>
>>>>> I would read this:
>>>>>
>>>>>
>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>
>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2) that
>>>>> discusses this issue.
>>>>>
>>>>> Some five years ago, I successfully crawled 5 million web documents,
>>>>> using Postgresql 8.3.  Postgresql 9.x is faster, and with modern SSD's, I
>>>>> expect that you will do even better.  In general, I'd say it was fine to
>>>>> shoot for 10M - 100M documents on ManifoldCF, provided that you use a good
>>>>> database, and provided that you maintain it properly.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I would like to know what is the maximum number of documents that you
>>>>>> managed to crawl with ManifoldCF and with how many connectors in parallel
>>>>>> it could works ?
>>>>>>
>>>>>> Thanks for your answer
>>>>>>
>>>>>> Baptiste
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Onstartup and job status request:

WARN 2014-09-10 19:33:51,027 (http-apr-8081-exec-7) - Found a long-running 
query (80303 ms): [SELECT COUNT(dochash)
 AS doccount FROM jobqueue t1 LIMIT 10001]
 WARN 2014-09-10 19:33:51,130 (http-apr-8081-exec-6) - Found a long-running 
query (433290 ms): [SELECT COUNT(dochash
) AS doccount FROM jobqueue t1 LIMIT 10001]
 WARN 2014-09-10 19:33:58,507 (http-apr-8081-exec-6) -  Plan: Limit  
(cost=539312.38..539312.39 rows=1 width=41) (ac
tual time=7377.778..7377.780 rows=1 loops=1)
 WARN 2014-09-10 19:33:58,507 (http-apr-8081-exec-6) -  Plan:   ->  Aggregate  
(cost=539312.38..539312.39 rows=1 wid
th=41) (actual time=7377.775..7377.776 rows=1 loops=1)
 WARN 2014-09-10 19:33:58,507 (http-apr-8081-exec-6) -  Plan:         ->  Seq 
Scan on jobqueue t1  (cost=0.00..51358
6.70 rows=10290270 width=41) (actual time=0.004..3237.619 rows=10290270 loops=1)
 WARN 2014-09-10 19:33:58,507 (http-apr-8081-exec-6) -  Plan: Total runtime: 
7377.829 ms
 WARN 2014-09-10 19:33:58,507 (http-apr-8081-exec-6) - 
 WARN 2014-09-10 19:33:58,522 (http-apr-8081-exec-6) -  Stats: n_distinct=4.0 
most_common_vals={G,C,Z,P} most_common
_freqs={0.41029999,0.36623332,0.16140001,0.062066667}
 WARN 2014-09-10 19:33:58,522 (http-apr-8081-exec-6) - 
 WARN 2014-09-10 19:33:59,119 (http-apr-8081-exec-7) -  Plan: Limit  
(cost=539312.38..539312.39 rows=1 width=41) (ac
tual time=8092.511..8092.513 rows=1 loops=1)
 WARN 2014-09-10 19:33:59,119 (http-apr-8081-exec-7) -  Plan:   ->  Aggregate  
(cost=539312.38..539312.39 rows=1 wid
th=41) (actual time=8092.508..8092.509 rows=1 loops=1)
 WARN 2014-09-10 19:33:59,119 (http-apr-8081-exec-7) -  Plan:         ->  Seq 
Scan on jobqueue t1  (cost=0.00..51358
6.70 rows=10290270 width=41) (actual time=0.004..3521.717 rows=10290270 loops=1)
 WARN 2014-09-10 19:33:59,120 (http-apr-8081-exec-7) -  Plan: Total runtime: 
8092.561 ms
 WARN 2014-09-10 19:33:59,120 (http-apr-8081-exec-7) - 
 WARN 2014-09-10 19:33:59,128 (http-apr-8081-exec-7) -  Stats: n_distinct=4.0 
most_common_vals={G,C,Z,P} most_common
_freqs={0.41029999,0.36623332,0.16140001,0.062066667}
 WARN 2014-09-10 19:33:59,128 (http-apr-8081-exec-7) - 
 
 
 Other queries:
 
  WARN 2014-09-05 12:33:37,436 (Worker thread '65') - Found a long-running 
query (529827 ms): [SELECT t0.id,t0.dochash,t0.docid FROM carrydown t1, 
jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND 
t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
 WARN 2014-09-05 12:33:37,887 (Finisher thread) - Found a long-running query 
(589231 ms): [SELECT id FROM jobs WHERE status IN (?,?,?,?,?) FOR UPDATE]
 WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a long-running 
query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM jobqueue t1 LIMIT 
1000001]
 WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a long-running 
query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount FROM jobqueue t1 
GROUP BY jobid]
 WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a long-running 
query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount FROM jobqueue t1 
GROUP BY jobid]
 WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a long-running query 
(596499 ms): [SELECT t0.id,t0.dochash,t0.docid FROM carrydown t1, jobqueue t0 
WHERE t1.jobid=? AND t1.parentidhash=? AND t0.dochash=t1.childidhash AND 
t0.jobid=t1.jobid AND t1.isnew=?]
 WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a long-running query 
(592474 ms): [SELECT id FROM jobs WHERE status IN (?,?,?,?,?) FOR UPDATE]
 
 
 Other 'Heavy' Queries:
 
 This query runs often taking on average 3 secs to exectue
 
 SELECT
 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
 t0.priorityset FROM jobqueue t0
 WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND t0.checktime
 <= 1407246846166
 AND EXISTS (
   SELECT 'x' FROM jobs t1
   WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND t1.priority=5
 )
 AND NOT EXISTS (
   SELECT 'x' FROM jobqueue t2
   WHERE t2.dochash=t0.dochash AND t2.status IN
 ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
 )
 AND NOT EXISTS (
   SELECT 'x' FROM prereqevents t3,events t4
   WHERE t0.id=t3.owner AND t3.eventname=t4.name
 )
 ORDER BY t0.docpriority ASC
 LIMIT 480;
 
 
 We believe that by  changing the use of IN to using OR, the query would 
execute 10 times faster, taking only 314ms to execute rather than 4411ms for 
the above query.
 
 SELECT
 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
 t0.priorityset FROM jobqueue t0
 WHERE (t0.status = 'P' OR t0.status = 'G')  AND t0.checkaction='R' AND
 t0.checktime <= 1407246846166
AND EXISTS (
   SELECT 'x' FROM jobs t1
   WHERE (t1.status = 'A' OR t1.status = 'a')  AND t1.id=t0.jobid  AND
 t1.priority=5
 )
 AND NOT EXISTS (
   SELECT 'x' FROM jobqueue t2
   WHERE t2.dochash=t0.dochash AND (t2.status = 'A' OR t2.status = 'F'
 OR t2.status = 'a' OR t2.status = 'f' OR t2.status = 'D' OR t2.status
 = 'd')
   AND t2.jobid!=t0.jobid
 )
 AND NOT EXISTS (
   SELECT 'x' FROM prereqevents t3,events t4
   WHERE t0.id=t3.owner AND t3.eventname=t4.name
 )
 ORDER BY t0.docpriority ASC
 LIMIT 480;

Re: Apache ManifoldCF Performance

Reply via email to