Re: Apache ManifoldCF Performance

Karl Wright Wed, 10 Sep 2014 10:35:10 -0700

Hi Paul,

The jobstatus query that uses count(*) should be doing something like this
when the maxdocumentstatuscount value is set:


select count(*) from jobqueue where xxx limit 500001

This will still do a sequential scan, but it will be an aborted one, so you
can control the maximum amount of time spent doing the query.

Karl


On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <[email protected]>
wrote:

> Hi,
>
> We've had a play with maxstatuscount and couldn't stop it from
> count(*)-ing but I'll certainly have another look to see if we've missed
> something.
>
> We're increasingly seeing long running threads and I'll put together some
> samples. As an example, on a job that's currently aborting:
>
> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a long-running
> query (72902 ms): [UPDATE jobqueue SET docpriority=?,priorityset=NULL WHERE
> jobid=?]
>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 0:
> '1.000000001E9'
>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 1:
> '1407144048075'
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update on
> jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->  Bitmap
> Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:         Recheck
> Cond: (jobid = 1407144048075::bigint)
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:         ->
> Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85 rows=764916
> width=0)
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
> Index Cond: (jobid = 1407144048075::bigint)
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats: n_distinct=4.0
> most_common_vals={G,C,Z,P}
> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>
> Paul
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <[email protected]> wrote:
>
>> Hi Paul,
>>
>> For the jobqueue scans from the UI, there is a parameter you can set
>> which limits the number of documents counted to at most a specified
>> amount.  This uses a limit clause, which should prevent unbounded time
>> doing these kinds of queries:
>>
>> org.apache.manifoldcf.ui.maxstatuscount
>>
>> The documentation says that the default value for this parameter is
>> 10000, which however is incorrect.  The actual true default is 500000.  You
>> could set that lower for better UI performance (losing some information, of
>> course.)
>>
>> As for long-running queries, a lot of time and effort has been spent in
>> MCF to insure that this doesn't happen.  Specifically, the main document
>> queuing query is structured to read directly out of a specific jobqueue
>> index.  This is the crucial query that must work properly for scalability,
>> since doing a query that is effectively just a sort on the entire jobqueue
>> would be a major problem.  There are some times where Postgresql's
>> optimizer fails to do the right thing here, mostly because it makes a huge
>> distinction between whether there's zero of something or one of something,
>> but you can work around that particular issue by setting the analyze count
>> to 1 if you start to see this problem -- which basically means that
>> reanalysis of the table has to occur on every stuffing query.
>>
>> I'd appreciate seeing the queries that are long-running in your case so
>> that I can see if that is what you are encountering or not.
>>
>> Thanks,
>> Karl
>>
>>
>>
>>
>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <[email protected]>
>> wrote:
>>
>>> Hi Karl,
>>>
>>> We're beginning to see issues with a document count > 10 million. At
>>> that point, even with good postgres vacuuming the jobqueue table is
>>> starting to become a bottleneck.
>>>
>>> For example select count(*) from jobqueue, which is executed when
>>> querying job status will do a full table scan of jobqueue which has
>>> more than 10 million rows. That's going to take some time in postgres.
>>>
>>> SSDs will certainly make a big difference to document processing
>>> through-put (which we see is largely I/O bound in postgres) but we are
>>> increasingly seeing long running queries in the logs. Our current thinking
>>> is that we'll need to refactor JobQueue somewhat to optimise queries
>>> and, potentially partition jobqueue into a subset of tables (table per
>>> queue for example).
>>>
>>> Paul
>>>
>>>
>>>
>>> VP Engineering,
>>> Exonar Ltd
>>>
>>> T: +44 7940 567724
>>>
>>> twitter:@exonarco @pboichat
>>> W: http://www.exonar.com
>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>
>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>> West Mills, Newbury, Berkshire, RG14 5HG
>>> DISCLAIMER: This email and any attachments to it may be confidential
>>> and are intended solely for the use of the individual to whom it is
>>> addressed. Any views or opinions expressed are solely those of the author
>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>> the intended recipient of this email, you must neither take any action
>>> based upon its contents, nor copy or show it to anyone. Please contact
>>> the sender if you believe you have received this email in error.
>>>
>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <[email protected]> wrote:
>>>
>>>> Hi Baptiste,
>>>>
>>>> ManifoldCF is not limited by the number of agents processes or parallel
>>>> connectors.  Overall database performance is the limiting factor.
>>>>
>>>> I would read this:
>>>>
>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>
>>>> Also, there's a section in ManifoldCF (I believe Chapter 2) that
>>>> discusses this issue.
>>>>
>>>> Some five years ago, I successfully crawled 5 million web documents,
>>>> using Postgresql 8.3.  Postgresql 9.x is faster, and with modern SSD's, I
>>>> expect that you will do even better.  In general, I'd say it was fine to
>>>> shoot for 10M - 100M documents on ManifoldCF, provided that you use a good
>>>> database, and provided that you maintain it properly.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I would like to know what is the maximum number of documents that you
>>>>> managed to crawl with ManifoldCF and with how many connectors in parallel
>>>>> it could works ?
>>>>>
>>>>> Thanks for your answer
>>>>>
>>>>> Baptiste
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Reply via email to