Re: Apache ManifoldCF Performance

Karl Wright Fri, 12 Sep 2014 08:02:37 -0700

The changes pass all tests here.  Is it possible that you attempted some
upgrade that failed (or didn't attempt upgrade but went to a new code
version)?


If you could let me know as exactly as possible what you did, I can let you
know if that should have worked or not.

Thanks!
Karl





On Fri, Sep 12, 2014 at 10:57 AM, Paul Boichat <[email protected]>
wrote:

> Karl,
>
> We appear to be seeing an issue with the performance change to use an OR
> clause rather than IN. After making the change, when we restart manifoldcf
> (with one job in running state) documents in the running job are not picked
> up for processing by the stuffer thread. If we redploy base 1.6.1 and
> restart documents are processed. This is consistently switchable depending
> on which code base is deployed.
>
> We have logs that I could upload to the ticket if you recommend that we
> reopen the issue (or create a new one)?
>
> Paul
>
>
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <[email protected]> wrote:
>
>> Hi Paul --
>>
>> Just to be clear -- the branch for CONNECTORS-1027 is a branch of trunk,
>> which is MCF 2.0.  MCF 2.0 is not backwards compatible with any previous
>> MCF release, and indeed there is no upgrade from any 1.x release to 2.0.
>> That's why I said to use the patches, and try to stay on 1.6.1 or at most
>> to migrate to 1.7.
>>
>> IF you ALREADY tried an upgrade with the branch code, then you would have
>> wound up in a schema state where the schema had more columns in it than the
>> branch knew how to deal with.  That's bad, and you will need to do things
>> to fix the situation.  I believe you should still be able to do the
>> following:
>>
>> - Download 1.7 source, or check out
>> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch
>> - Apply the patches
>> - Build
>> - Modify your properties.xml to point to your postgresql instance
>> - Run the upgrade (initialize.bat on the multi-process example, or start
>> the single-process example)
>>
>> You should then have a working 1.7 release, with code patches applied.
>>
>> Thanks,
>> Karl
>>
>>
>>
>>
>> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <[email protected]>
>> wrote:
>>
>>> Thanks - we've pulled down the branch and will test the changes. It
>>> looks like a branch of 1.7 so it's going to take us a little while to test.
>>> We need to migrate our connectors (there's some deprecated stuff that's now
>>> been cleared in 1.7 .eg. getShareACL) and we'll need to patch the database
>>> to include the pipeline and any other schema changes. We'll have some
>>> environment contention over the next week as our performance test
>>> environment needs to remain on 1.6.1 while we test a release. Once that's
>>> clear I'll move to 1.7
>>>
>>> On the database schema patch moving from 1.6.1 to 1.7 - is there a
>>> simple way to migrate and existing database?
>>>
>>> Thanks,
>>>
>>> Paul
>>>
>>>
>>>
>>>
>>>
>>> VP Engineering,
>>> Exonar Ltd
>>>
>>> T: +44 7940 567724
>>>
>>> twitter:@exonarco @pboichat
>>> W: http://www.exonar.com
>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>
>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>> West Mills, Newbury, Berkshire, RG14 5HG
>>> DISCLAIMER: This email and any attachments to it may be confidential
>>> and are intended solely for the use of the individual to whom it is
>>> addressed. Any views or opinions expressed are solely those of the author
>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>> the intended recipient of this email, you must neither take any action
>>> based upon its contents, nor copy or show it to anyone. Please contact
>>> the sender if you believe you have received this email in error.
>>>
>>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <[email protected]> wrote:
>>>
>>>> Thanks -- I'll include that change as well then, in ticket
>>>> CONNECTORS-1027.
>>>>
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> That comes back immediately with 10001 rows:
>>>>>
>>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT
>>>>> 10001) t;
>>>>>
>>>>> QUERY PLAN
>>>>>
>>>>>
>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>> ----------------------------------
>>>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>>>> time=9.125..9.125 rows=1 loops=1)
>>>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>>>> time=0.033..6.945 rows=10001 loops=1)
>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>>>> =0.031..3.257 rows=10001 loops=1)
>>>>>                Heap Fetches: 725
>>>>>  Total runtime: 9.157 ms
>>>>> (5 rows)
>>>>>
>>>>>
>>>>> Whereas:
>>>>>
>>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>>>
>>>>> QUERY PLAN
>>>>>
>>>>>
>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>> ----------------------------------------
>>>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>> time=5225.107..5225.109 rows=1 loops=1)
>>>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>> time=5225.105..5225.106 rows=1 loops=1)
>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>>>> =0.108..3090.848 rows=10370209 loops=1)
>>>>>                Heap Fetches: 684297
>>>>>  Total runtime: 5225.151 ms
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>> VP Engineering,
>>>>> Exonar Ltd
>>>>>
>>>>> T: +44 7940 567724
>>>>>
>>>>> twitter:@exonarco @pboichat
>>>>> W: http://www.exonar.com
>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>
>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>> and are intended solely for the use of the individual to whom it is
>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>> the intended recipient of this email, you must neither take any action
>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>> contact the sender if you believe you have received this email in error.
>>>>>
>>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Paul,
>>>>>>
>>>>>> Could you try this query on your database please and tell me if it
>>>>>> executes promptly:
>>>>>>
>>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>>>
>>>>>>
>>>>>> I vaguely remember that I had to change the form of this query in
>>>>>> order to support MySQL -- but first let's see if this helps.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
>>>>>>> (branches/CONNECTORS-1027) for looking at any changes we do for 
>>>>>>> large-scale
>>>>>>> Postgresql optimization work.
>>>>>>>
>>>>>>> Please note that trunk code already has schema changes relative to
>>>>>>> MCF 1.7, so you will not be able to work directly with this branch code.
>>>>>>> I'll have to create patches for whatever changes you would need to try.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them to the
>>>>>>>> doc.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Paul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> VP Engineering,
>>>>>>>> Exonar Ltd
>>>>>>>>
>>>>>>>> T: +44 7940 567724
>>>>>>>>
>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>> W: http://www.exonar.com
>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>> <http://video.exonar.com/>
>>>>>>>>
>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>> confidential and are intended solely for the use of the individual to 
>>>>>>>> whom
>>>>>>>> it is addressed. Any views or opinions expressed are solely those of 
>>>>>>>> the
>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>> you are not the intended recipient of this email, you must neither 
>>>>>>>> take any
>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>> contact the sender if you believe you have received this email in 
>>>>>>>> error.
>>>>>>>>
>>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Paul,
>>>>>>>>>
>>>>>>>>> Can you include the logged plan for this query; this is an actual
>>>>>>>>> query encountered during crawling:
>>>>>>>>>
>>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? 
>>>>>>>>> AND
>>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> These queries are all from the UI; it is what gets generated when
>>>>>>>>> no limits are in place:
>>>>>>>>>
>>>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>>>>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS 
>>>>>>>>> doccount
>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>>>>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS 
>>>>>>>>> doccount
>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>
>>>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>>>
>>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>>>>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount 
>>>>>>>>> FROM
>>>>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>>>>
>>>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>>>> sequential scan of the entire table when given a limit clause.  It
>>>>>>>>> certainly didn't used to do that.  If you have any other suggestions 
>>>>>>>>> please
>>>>>>>>> let me know.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Some queries show up in this list because MCF periodically
>>>>>>>>> reindexes tables.  For example, this query goes only against the 
>>>>>>>>> (small)
>>>>>>>>> jobs table.  Its poor performance on occasion is likely due to 
>>>>>>>>> something
>>>>>>>>> else happening to the database, probably a reindex:
>>>>>>>>>
>>>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN
>>>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The final query is the document stuffing query, which is perhaps
>>>>>>>>> the most critical query in the whole system:
>>>>>>>>>
>>>>>>>>>  SELECT
>>>>>>>>>  t0.id
>>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>>>> t0.checktime
>>>>>>>>>  <= 1407246846166
>>>>>>>>>  AND EXISTS (
>>>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>>>> t1.priority=5
>>>>>>>>>  )
>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>>>  )
>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>>>  )
>>>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>>>  LIMIT 480;
>>>>>>>>>
>>>>>>>>> Your analysis of whether IN beats OR does not agree with
>>>>>>>>> experiments I did on postgresql 8.7 which showed no difference.  What
>>>>>>>>> Postgresql version are you using?  Also, I trust you have query plans 
>>>>>>>>> that
>>>>>>>>> demonstrate your claim?  In any case, whether IN vs. OR is generated 
>>>>>>>>> is a
>>>>>>>>> function of the MCF database driver, so this is trivial to experiment
>>>>>>>>> with.  I'll create a ticket and a branch for experimentation.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Karl,
>>>>>>>>>>
>>>>>>>>>> Changing maxcountstatus to something much smaller (10,000)
>>>>>>>>>> doesn't seem to buy us that much on the table scan - in the attached 
>>>>>>>>>> you'll
>>>>>>>>>> see that it's still taking a long time to return the job status 
>>>>>>>>>> page. Also
>>>>>>>>>> in the attached are some sample other long running queries that we're
>>>>>>>>>> beginning to see more frequently. There's also an example of a query 
>>>>>>>>>> that's
>>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested 
>>>>>>>>>> change
>>>>>>>>>> to improve performance). This one in particular would certainly 
>>>>>>>>>> benefit
>>>>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck 
>>>>>>>>>> on
>>>>>>>>>> postgres.
>>>>>>>>>>
>>>>>>>>>> We're loading the system from 10mil towards 100mil so would be
>>>>>>>>>> keen to work with you to optimise where possible.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Paul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> VP Engineering,
>>>>>>>>>> Exonar Ltd
>>>>>>>>>>
>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>
>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>
>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>> confidential and are intended solely for the use of the individual 
>>>>>>>>>> to whom
>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of 
>>>>>>>>>> the
>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>> you are not the intended recipient of this email, you must neither 
>>>>>>>>>> take any
>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>> contact the sender if you believe you have received this email in 
>>>>>>>>>> error.
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>
>>>>>>>>>>> The jobstatus query that uses count(*) should be doing something
>>>>>>>>>>> like this when the maxdocumentstatuscount value is set:
>>>>>>>>>>>
>>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>>>
>>>>>>>>>>> This will still do a sequential scan, but it will be an aborted
>>>>>>>>>>> one, so you can control the maximum amount of time spent doing the 
>>>>>>>>>>> query.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it from
>>>>>>>>>>>> count(*)-ing but I'll certainly have another look to see if we've 
>>>>>>>>>>>> missed
>>>>>>>>>>>> something.
>>>>>>>>>>>>
>>>>>>>>>>>> We're increasingly seeing long running threads and I'll put
>>>>>>>>>>>> together some samples. As an example, on a job that's currently 
>>>>>>>>>>>> aborting:
>>>>>>>>>>>>
>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter
>>>>>>>>>>>> 0: '1.000000001E9'
>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter
>>>>>>>>>>>> 1: '1407144048075'
>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>>>> Update on jobqueue  (cost=18806.08..445770.39 rows=764916 
>>>>>>>>>>>> width=287)
>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->
>>>>>>>>>>>> Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916
>>>>>>>>>>>> width=287)
>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>> Plan:         Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>> Plan:         ->  Bitmap Index Scan on i1392985450177  
>>>>>>>>>>>> (cost=0.00..18614.85
>>>>>>>>>>>> rows=764916 width=0)
>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>
>>>>>>>>>>>> Paul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>
>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>
>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>
>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>> confidential and are intended solely for the use of the individual 
>>>>>>>>>>>> to whom
>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those 
>>>>>>>>>>>> of the
>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>> you are not the intended recipient of this email, you must neither 
>>>>>>>>>>>> take any
>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. 
>>>>>>>>>>>> Please
>>>>>>>>>>>> contact the sender if you believe you have received this email in 
>>>>>>>>>>>> error.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>
>>>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter you
>>>>>>>>>>>>> can set which limits the number of documents counted to at most a 
>>>>>>>>>>>>> specified
>>>>>>>>>>>>> amount.  This uses a limit clause, which should prevent unbounded 
>>>>>>>>>>>>> time
>>>>>>>>>>>>> doing these kinds of queries:
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>>>
>>>>>>>>>>>>> The documentation says that the default value for this
>>>>>>>>>>>>> parameter is 10000, which however is incorrect.  The actual true 
>>>>>>>>>>>>> default is
>>>>>>>>>>>>> 500000.  You could set that lower for better UI performance 
>>>>>>>>>>>>> (losing some
>>>>>>>>>>>>> information, of course.)
>>>>>>>>>>>>>
>>>>>>>>>>>>> As for long-running queries, a lot of time and effort has been
>>>>>>>>>>>>> spent in MCF to insure that this doesn't happen.  Specifically, 
>>>>>>>>>>>>> the main
>>>>>>>>>>>>> document queuing query is structured to read directly out of a 
>>>>>>>>>>>>> specific
>>>>>>>>>>>>> jobqueue index.  This is the crucial query that must work 
>>>>>>>>>>>>> properly for
>>>>>>>>>>>>> scalability, since doing a query that is effectively just a sort 
>>>>>>>>>>>>> on the
>>>>>>>>>>>>> entire jobqueue would be a major problem.  There are some times 
>>>>>>>>>>>>> where
>>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly 
>>>>>>>>>>>>> because it
>>>>>>>>>>>>> makes a huge distinction between whether there's zero of 
>>>>>>>>>>>>> something or one
>>>>>>>>>>>>> of something, but you can work around that particular issue by 
>>>>>>>>>>>>> setting the
>>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which 
>>>>>>>>>>>>> basically
>>>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing 
>>>>>>>>>>>>> query.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running in
>>>>>>>>>>>>> your case so that I can see if that is what you are encountering 
>>>>>>>>>>>>> or not.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We're beginning to see issues with a document count > 10
>>>>>>>>>>>>>> million. At that point, even with good postgres vacuuming
>>>>>>>>>>>>>> the jobqueue table is starting to become a bottleneck.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For example select count(*) from jobqueue, which is executed
>>>>>>>>>>>>>> when querying job status will do a full table scan of
>>>>>>>>>>>>>> jobqueue which has more than 10 million rows. That's going
>>>>>>>>>>>>>> to take some time in postgres.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in
>>>>>>>>>>>>>> postgres) but we are increasingly seeing long running
>>>>>>>>>>>>>> queries in the logs. Our current thinking is that we'll need to 
>>>>>>>>>>>>>> refactor
>>>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially
>>>>>>>>>>>>>> partition jobqueue into a subset of tables (table per queue
>>>>>>>>>>>>>> for example).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>> confidential and are intended solely for the use of the 
>>>>>>>>>>>>>> individual to whom
>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely 
>>>>>>>>>>>>>> those of the
>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>> you are not the intended recipient of this email, you must 
>>>>>>>>>>>>>> neither take any
>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. 
>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>> contact the sender if you believe you have received this email 
>>>>>>>>>>>>>> in error.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents processes
>>>>>>>>>>>>>>> or parallel connectors.  Overall database performance is the 
>>>>>>>>>>>>>>> limiting
>>>>>>>>>>>>>>> factor.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2)
>>>>>>>>>>>>>>> that discusses this issue.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million web
>>>>>>>>>>>>>>> documents, using Postgresql 8.3.  Postgresql 9.x is faster, and 
>>>>>>>>>>>>>>> with modern
>>>>>>>>>>>>>>> SSD's, I expect that you will do even better.  In general, I'd 
>>>>>>>>>>>>>>> say it was
>>>>>>>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided 
>>>>>>>>>>>>>>> that you use
>>>>>>>>>>>>>>> a good database, and provided that you maintain it properly.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would like to know what is the maximum number of
>>>>>>>>>>>>>>>> documents that you managed to crawl with ManifoldCF and with 
>>>>>>>>>>>>>>>> how many
>>>>>>>>>>>>>>>> connectors in parallel it could works ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Reply via email to