Re: Apache ManifoldCF Performance

Karl Wright Thu, 11 Sep 2014 04:26:26 -0700

Hi Paul,

Could you try this query on your database please and tell me if it executes
promptly:


SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t


I vaguely remember that I had to change the form of this query in order to
support MySQL -- but first let's see if this helps.

Karl


On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <[email protected]> wrote:

> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
> (branches/CONNECTORS-1027) for looking at any changes we do for large-scale
> Postgresql optimization work.
>
> Please note that trunk code already has schema changes relative to MCF
> 1.7, so you will not be able to work directly with this branch code.  I'll
> have to create patches for whatever changes you would need to try.
>
> Thanks,
> Karl
>
>
> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <[email protected]>
> wrote:
>
>> Hi,
>>
>> We're on Postgres 9.2. I'll get the query plans and add them to the doc.
>>
>> Thanks
>>
>> Paul
>>
>>
>>
>> VP Engineering,
>> Exonar Ltd
>>
>> T: +44 7940 567724
>>
>> twitter:@exonarco @pboichat
>> W: http://www.exonar.com
>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>
>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>> West Mills, Newbury, Berkshire, RG14 5HG
>> DISCLAIMER: This email and any attachments to it may be confidential and
>> are intended solely for the use of the individual to whom it is addressed.
>> Any views or opinions expressed are solely those of the author and do not
>> necessarily represent those of Exonar Ltd. If you are not the intended
>> recipient of this email, you must neither take any action based upon its
>> contents, nor copy or show it to anyone. Please contact the sender if
>> you believe you have received this email in error.
>>
>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <[email protected]> wrote:
>>
>>> Hi Paul,
>>>
>>> Can you include the logged plan for this query; this is an actual query
>>> encountered during crawling:
>>>
>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a long-running
>>> query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid FROM carrydown t1,
>>> jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>
>>>
>>> These queries are all from the UI; it is what gets generated when no
>>> limits are in place:
>>>
>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>> FROM jobqueue t1 GROUP BY jobid]
>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>> FROM jobqueue t1 GROUP BY jobid]
>>>
>>> This query is from the UI with a limit of 1000000:
>>>
>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>> jobqueue t1 LIMIT 1000001]
>>>
>>> I honestly don't understand why PostgreSQL would execute a sequential
>>> scan of the entire table when given a limit clause.  It certainly didn't
>>> used to do that.  If you have any other suggestions please let me know.
>>>
>>>
>>>
>>> Some queries show up in this list because MCF periodically reindexes
>>> tables.  For example, this query goes only against the (small) jobs table.
>>> Its poor performance on occasion is likely due to something else happening
>>> to the database, probably a reindex:
>>>
>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a long-running
>>> query (592474 ms): [SELECT id FROM jobs WHERE status IN (?,?,?,?,?) FOR
>>> UPDATE]
>>>
>>>
>>> The final query is the document stuffing query, which is perhaps the
>>> most critical query in the whole system:
>>>
>>>  SELECT
>>>  t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>  t0.priorityset FROM jobqueue t0
>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND t0.checktime
>>>  <= 1407246846166
>>>  AND EXISTS (
>>>    SELECT 'x' FROM jobs t1
>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND t1.priority=5
>>>  )
>>>  AND NOT EXISTS (
>>>    SELECT 'x' FROM jobqueue t2
>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>  )
>>>  AND NOT EXISTS (
>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>  )
>>>  ORDER BY t0.docpriority ASC
>>>  LIMIT 480;
>>>
>>> Your analysis of whether IN beats OR does not agree with experiments I
>>> did on postgresql 8.7 which showed no difference.  What Postgresql version
>>> are you using?  Also, I trust you have query plans that demonstrate your
>>> claim?  In any case, whether IN vs. OR is generated is a function of the
>>> MCF database driver, so this is trivial to experiment with.  I'll create a
>>> ticket and a branch for experimentation.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>>
>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <[email protected]>
>>> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> Changing maxcountstatus to something much smaller (10,000) doesn't seem
>>>> to buy us that much on the table scan - in the attached you'll see that
>>>> it's still taking a long time to return the job status page. Also in the
>>>> attached are some sample other long running queries that we're beginning to
>>>> see more frequently. There's also an example of a query that's frequently
>>>> executed and regularly takes > 4 secs (plus a suggested change to improve
>>>> performance). This one in particular would certainly benefit from a change
>>>> to SSDs which should relieve the I/O bound bottleneck on postgres.
>>>>
>>>> We're loading the system from 10mil towards 100mil so would be keen to
>>>> work with you to optimise where possible.
>>>>
>>>> Thanks,
>>>>
>>>> Paul
>>>>
>>>>
>>>>
>>>>
>>>> VP Engineering,
>>>> Exonar Ltd
>>>>
>>>> T: +44 7940 567724
>>>>
>>>> twitter:@exonarco @pboichat
>>>> W: http://www.exonar.com
>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>
>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>> and are intended solely for the use of the individual to whom it is
>>>> addressed. Any views or opinions expressed are solely those of the author
>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>> the intended recipient of this email, you must neither take any action
>>>> based upon its contents, nor copy or show it to anyone. Please contact
>>>> the sender if you believe you have received this email in error.
>>>>
>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Paul,
>>>>>
>>>>> The jobstatus query that uses count(*) should be doing something like
>>>>> this when the maxdocumentstatuscount value is set:
>>>>>
>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>
>>>>> This will still do a sequential scan, but it will be an aborted one,
>>>>> so you can control the maximum amount of time spent doing the query.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <[email protected]
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We've had a play with maxstatuscount and couldn't stop it from
>>>>>> count(*)-ing but I'll certainly have another look to see if we've missed
>>>>>> something.
>>>>>>
>>>>>> We're increasingly seeing long running threads and I'll put together
>>>>>> some samples. As an example, on a job that's currently aborting:
>>>>>>
>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 0:
>>>>>> '1.000000001E9'
>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 1:
>>>>>> '1407144048075'
>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update on
>>>>>> jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->
>>>>>> Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916
>>>>>> width=287)
>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>> Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:         ->
>>>>>> Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85 rows=764916
>>>>>> width=0)
>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>> VP Engineering,
>>>>>> Exonar Ltd
>>>>>>
>>>>>> T: +44 7940 567724
>>>>>>
>>>>>> twitter:@exonarco @pboichat
>>>>>> W: http://www.exonar.com
>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>
>>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>>> and are intended solely for the use of the individual to whom it is
>>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>>> the intended recipient of this email, you must neither take any action
>>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>
>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> For the jobqueue scans from the UI, there is a parameter you can set
>>>>>>> which limits the number of documents counted to at most a specified
>>>>>>> amount.  This uses a limit clause, which should prevent unbounded time
>>>>>>> doing these kinds of queries:
>>>>>>>
>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>
>>>>>>> The documentation says that the default value for this parameter is
>>>>>>> 10000, which however is incorrect.  The actual true default is 500000.  
>>>>>>> You
>>>>>>> could set that lower for better UI performance (losing some 
>>>>>>> information, of
>>>>>>> course.)
>>>>>>>
>>>>>>> As for long-running queries, a lot of time and effort has been spent
>>>>>>> in MCF to insure that this doesn't happen.  Specifically, the main 
>>>>>>> document
>>>>>>> queuing query is structured to read directly out of a specific jobqueue
>>>>>>> index.  This is the crucial query that must work properly for 
>>>>>>> scalability,
>>>>>>> since doing a query that is effectively just a sort on the entire 
>>>>>>> jobqueue
>>>>>>> would be a major problem.  There are some times where Postgresql's
>>>>>>> optimizer fails to do the right thing here, mostly because it makes a 
>>>>>>> huge
>>>>>>> distinction between whether there's zero of something or one of 
>>>>>>> something,
>>>>>>> but you can work around that particular issue by setting the analyze 
>>>>>>> count
>>>>>>> to 1 if you start to see this problem -- which basically means that
>>>>>>> reanalysis of the table has to occur on every stuffing query.
>>>>>>>
>>>>>>> I'd appreciate seeing the queries that are long-running in your case
>>>>>>> so that I can see if that is what you are encountering or not.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Karl,
>>>>>>>>
>>>>>>>> We're beginning to see issues with a document count > 10 million.
>>>>>>>> At that point, even with good postgres vacuuming the jobqueue
>>>>>>>> table is starting to become a bottleneck.
>>>>>>>>
>>>>>>>> For example select count(*) from jobqueue, which is executed when
>>>>>>>> querying job status will do a full table scan of jobqueue which
>>>>>>>> has more than 10 million rows. That's going to take some time in
>>>>>>>> postgres.
>>>>>>>>
>>>>>>>> SSDs will certainly make a big difference to document processing
>>>>>>>> through-put (which we see is largely I/O bound in postgres) but we
>>>>>>>> are increasingly seeing long running queries in the logs. Our current
>>>>>>>> thinking is that we'll need to refactor JobQueue somewhat to
>>>>>>>> optimise queries and, potentially partition jobqueue into a subset
>>>>>>>> of tables (table per queue for example).
>>>>>>>>
>>>>>>>> Paul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> VP Engineering,
>>>>>>>> Exonar Ltd
>>>>>>>>
>>>>>>>> T: +44 7940 567724
>>>>>>>>
>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>> W: http://www.exonar.com
>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>> <http://video.exonar.com/>
>>>>>>>>
>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>> confidential and are intended solely for the use of the individual to 
>>>>>>>> whom
>>>>>>>> it is addressed. Any views or opinions expressed are solely those of 
>>>>>>>> the
>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>> you are not the intended recipient of this email, you must neither 
>>>>>>>> take any
>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>> contact the sender if you believe you have received this email in 
>>>>>>>> error.
>>>>>>>>
>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Baptiste,
>>>>>>>>>
>>>>>>>>> ManifoldCF is not limited by the number of agents processes or
>>>>>>>>> parallel connectors.  Overall database performance is the limiting 
>>>>>>>>> factor.
>>>>>>>>>
>>>>>>>>> I would read this:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>
>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2) that
>>>>>>>>> discusses this issue.
>>>>>>>>>
>>>>>>>>> Some five years ago, I successfully crawled 5 million web
>>>>>>>>> documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with 
>>>>>>>>> modern
>>>>>>>>> SSD's, I expect that you will do even better.  In general, I'd say it 
>>>>>>>>> was
>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided that 
>>>>>>>>> you use
>>>>>>>>> a good database, and provided that you maintain it properly.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi
>>>>>>>>>>
>>>>>>>>>> I would like to know what is the maximum number of documents that
>>>>>>>>>> you managed to crawl with ManifoldCF and with how many connectors in
>>>>>>>>>> parallel it could works ?
>>>>>>>>>>
>>>>>>>>>> Thanks for your answer
>>>>>>>>>>
>>>>>>>>>> Baptiste
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Reply via email to