Re: Apache ManifoldCF Performance

Karl Wright Thu, 11 Sep 2014 08:49:27 -0700

Hi Paul,

The patches I included should work (at worst with some code offsets)
against 1.6.1 code without problems.  I suggest you try that first before
going through an upgrade to 1.7.


Thanks,
Karl

On Thu, Sep 11, 2014 at 11:47 AM, Karl Wright <[email protected]> wrote:

> Hi Paul,
>
> Upgrade is included automatically in MCF.
>
> If you are using the multiprocess example, just run the initialize script
> and it will do the upgrade.  The single process example does any required
> upgrades every time you start it.
>
> Thanks,
> Karl
>
>
>
> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <[email protected]>
> wrote:
>
>> Thanks - we've pulled down the branch and will test the changes. It looks
>> like a branch of 1.7 so it's going to take us a little while to test. We
>> need to migrate our connectors (there's some deprecated stuff that's now
>> been cleared in 1.7 .eg. getShareACL) and we'll need to patch the database
>> to include the pipeline and any other schema changes. We'll have some
>> environment contention over the next week as our performance test
>> environment needs to remain on 1.6.1 while we test a release. Once that's
>> clear I'll move to 1.7
>>
>> On the database schema patch moving from 1.6.1 to 1.7 - is there a simple
>> way to migrate and existing database?
>>
>> Thanks,
>>
>> Paul
>>
>>
>>
>>
>>
>> VP Engineering,
>> Exonar Ltd
>>
>> T: +44 7940 567724
>>
>> twitter:@exonarco @pboichat
>> W: http://www.exonar.com
>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>
>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>> West Mills, Newbury, Berkshire, RG14 5HG
>> DISCLAIMER: This email and any attachments to it may be confidential and
>> are intended solely for the use of the individual to whom it is addressed.
>> Any views or opinions expressed are solely those of the author and do not
>> necessarily represent those of Exonar Ltd. If you are not the intended
>> recipient of this email, you must neither take any action based upon its
>> contents, nor copy or show it to anyone. Please contact the sender if
>> you believe you have received this email in error.
>>
>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <[email protected]> wrote:
>>
>>> Thanks -- I'll include that change as well then, in ticket
>>> CONNECTORS-1027.
>>>
>>>
>>> Karl
>>>
>>>
>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> That comes back immediately with 10001 rows:
>>>>
>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT
>>>> 10001) t;
>>>>
>>>> QUERY PLAN
>>>>
>>>>
>>>> -----------------------------------------------------------------------------------------------------------------------
>>>> ----------------------------------
>>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>>> time=9.125..9.125 rows=1 loops=1)
>>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>>> time=0.033..6.945 rows=10001 loops=1)
>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>>> =0.031..3.257 rows=10001 loops=1)
>>>>                Heap Fetches: 725
>>>>  Total runtime: 9.157 ms
>>>> (5 rows)
>>>>
>>>>
>>>> Whereas:
>>>>
>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>>
>>>> QUERY PLAN
>>>>
>>>>
>>>> -----------------------------------------------------------------------------------------------------------------------
>>>> ----------------------------------------
>>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>> time=5225.107..5225.109 rows=1 loops=1)
>>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>> time=5225.105..5225.106 rows=1 loops=1)
>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>>> =0.108..3090.848 rows=10370209 loops=1)
>>>>                Heap Fetches: 684297
>>>>  Total runtime: 5225.151 ms
>>>>
>>>> Thanks,
>>>>
>>>> Paul
>>>>
>>>>
>>>>
>>>> VP Engineering,
>>>> Exonar Ltd
>>>>
>>>> T: +44 7940 567724
>>>>
>>>> twitter:@exonarco @pboichat
>>>> W: http://www.exonar.com
>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>
>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>> and are intended solely for the use of the individual to whom it is
>>>> addressed. Any views or opinions expressed are solely those of the author
>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>> the intended recipient of this email, you must neither take any action
>>>> based upon its contents, nor copy or show it to anyone. Please contact
>>>> the sender if you believe you have received this email in error.
>>>>
>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Paul,
>>>>>
>>>>> Could you try this query on your database please and tell me if it
>>>>> executes promptly:
>>>>>
>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>>
>>>>>
>>>>> I vaguely remember that I had to change the form of this query in
>>>>> order to support MySQL -- but first let's see if this helps.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
>>>>>> (branches/CONNECTORS-1027) for looking at any changes we do for 
>>>>>> large-scale
>>>>>> Postgresql optimization work.
>>>>>>
>>>>>> Please note that trunk code already has schema changes relative to
>>>>>> MCF 1.7, so you will not be able to work directly with this branch code.
>>>>>> I'll have to create patches for whatever changes you would need to try.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them to the
>>>>>>> doc.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> VP Engineering,
>>>>>>> Exonar Ltd
>>>>>>>
>>>>>>> T: +44 7940 567724
>>>>>>>
>>>>>>> twitter:@exonarco @pboichat
>>>>>>> W: http://www.exonar.com
>>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>>
>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>> confidential and are intended solely for the use of the individual to 
>>>>>>> whom
>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>> author and do not necessarily represent those of Exonar Ltd. If you
>>>>>>> are not the intended recipient of this email, you must neither take any
>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>
>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Paul,
>>>>>>>>
>>>>>>>> Can you include the logged plan for this query; this is an actual
>>>>>>>> query encountered during crawling:
>>>>>>>>
>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? 
>>>>>>>> AND
>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>>
>>>>>>>>
>>>>>>>> These queries are all from the UI; it is what gets generated when
>>>>>>>> no limits are in place:
>>>>>>>>
>>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>>>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS 
>>>>>>>> doccount
>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>>>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS 
>>>>>>>> doccount
>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>
>>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>>
>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>>>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>>>
>>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>>> sequential scan of the entire table when given a limit clause.  It
>>>>>>>> certainly didn't used to do that.  If you have any other suggestions 
>>>>>>>> please
>>>>>>>> let me know.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Some queries show up in this list because MCF periodically
>>>>>>>> reindexes tables.  For example, this query goes only against the 
>>>>>>>> (small)
>>>>>>>> jobs table.  Its poor performance on occasion is likely due to 
>>>>>>>> something
>>>>>>>> else happening to the database, probably a reindex:
>>>>>>>>
>>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN
>>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>>
>>>>>>>>
>>>>>>>> The final query is the document stuffing query, which is perhaps
>>>>>>>> the most critical query in the whole system:
>>>>>>>>
>>>>>>>>  SELECT
>>>>>>>>  t0.id
>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>>> t0.checktime
>>>>>>>>  <= 1407246846166
>>>>>>>>  AND EXISTS (
>>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>>> t1.priority=5
>>>>>>>>  )
>>>>>>>>  AND NOT EXISTS (
>>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>>  )
>>>>>>>>  AND NOT EXISTS (
>>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>>  )
>>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>>  LIMIT 480;
>>>>>>>>
>>>>>>>> Your analysis of whether IN beats OR does not agree with
>>>>>>>> experiments I did on postgresql 8.7 which showed no difference.  What
>>>>>>>> Postgresql version are you using?  Also, I trust you have query plans 
>>>>>>>> that
>>>>>>>> demonstrate your claim?  In any case, whether IN vs. OR is generated 
>>>>>>>> is a
>>>>>>>> function of the MCF database driver, so this is trivial to experiment
>>>>>>>> with.  I'll create a ticket and a branch for experimentation.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Karl,
>>>>>>>>>
>>>>>>>>> Changing maxcountstatus to something much smaller (10,000) doesn't
>>>>>>>>> seem to buy us that much on the table scan - in the attached you'll 
>>>>>>>>> see
>>>>>>>>> that it's still taking a long time to return the job status page. 
>>>>>>>>> Also in
>>>>>>>>> the attached are some sample other long running queries that we're
>>>>>>>>> beginning to see more frequently. There's also an example of a query 
>>>>>>>>> that's
>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested 
>>>>>>>>> change
>>>>>>>>> to improve performance). This one in particular would certainly 
>>>>>>>>> benefit
>>>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>>>>>> postgres.
>>>>>>>>>
>>>>>>>>> We're loading the system from 10mil towards 100mil so would be
>>>>>>>>> keen to work with you to optimise where possible.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Paul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> VP Engineering,
>>>>>>>>> Exonar Ltd
>>>>>>>>>
>>>>>>>>> T: +44 7940 567724
>>>>>>>>>
>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>> W: http://www.exonar.com
>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>
>>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>> confidential and are intended solely for the use of the individual to 
>>>>>>>>> whom
>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of 
>>>>>>>>> the
>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>> you are not the intended recipient of this email, you must neither 
>>>>>>>>> take any
>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>> contact the sender if you believe you have received this email in 
>>>>>>>>> error.
>>>>>>>>>
>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Paul,
>>>>>>>>>>
>>>>>>>>>> The jobstatus query that uses count(*) should be doing something
>>>>>>>>>> like this when the maxdocumentstatuscount value is set:
>>>>>>>>>>
>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>>
>>>>>>>>>> This will still do a sequential scan, but it will be an aborted
>>>>>>>>>> one, so you can control the maximum amount of time spent doing the 
>>>>>>>>>> query.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it from
>>>>>>>>>>> count(*)-ing but I'll certainly have another look to see if we've 
>>>>>>>>>>> missed
>>>>>>>>>>> something.
>>>>>>>>>>>
>>>>>>>>>>> We're increasingly seeing long running threads and I'll put
>>>>>>>>>>> together some samples. As an example, on a job that's currently 
>>>>>>>>>>> aborting:
>>>>>>>>>>>
>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter
>>>>>>>>>>> 0: '1.000000001E9'
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter
>>>>>>>>>>> 1: '1407144048075'
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update
>>>>>>>>>>> on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->
>>>>>>>>>>> Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916
>>>>>>>>>>> width=287)
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>> Plan:         Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>> Plan:         ->  Bitmap Index Scan on i1392985450177  
>>>>>>>>>>> (cost=0.00..18614.85
>>>>>>>>>>> rows=764916 width=0)
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>
>>>>>>>>>>> Paul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> VP Engineering,
>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>
>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>
>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>
>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>> confidential and are intended solely for the use of the individual 
>>>>>>>>>>> to whom
>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those 
>>>>>>>>>>> of the
>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>> you are not the intended recipient of this email, you must neither 
>>>>>>>>>>> take any
>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. 
>>>>>>>>>>> Please
>>>>>>>>>>> contact the sender if you believe you have received this email in 
>>>>>>>>>>> error.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <[email protected]
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>
>>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter you
>>>>>>>>>>>> can set which limits the number of documents counted to at most a 
>>>>>>>>>>>> specified
>>>>>>>>>>>> amount.  This uses a limit clause, which should prevent unbounded 
>>>>>>>>>>>> time
>>>>>>>>>>>> doing these kinds of queries:
>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>>
>>>>>>>>>>>> The documentation says that the default value for this
>>>>>>>>>>>> parameter is 10000, which however is incorrect.  The actual true 
>>>>>>>>>>>> default is
>>>>>>>>>>>> 500000.  You could set that lower for better UI performance 
>>>>>>>>>>>> (losing some
>>>>>>>>>>>> information, of course.)
>>>>>>>>>>>>
>>>>>>>>>>>> As for long-running queries, a lot of time and effort has been
>>>>>>>>>>>> spent in MCF to insure that this doesn't happen.  Specifically, 
>>>>>>>>>>>> the main
>>>>>>>>>>>> document queuing query is structured to read directly out of a 
>>>>>>>>>>>> specific
>>>>>>>>>>>> jobqueue index.  This is the crucial query that must work properly 
>>>>>>>>>>>> for
>>>>>>>>>>>> scalability, since doing a query that is effectively just a sort 
>>>>>>>>>>>> on the
>>>>>>>>>>>> entire jobqueue would be a major problem.  There are some times 
>>>>>>>>>>>> where
>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly 
>>>>>>>>>>>> because it
>>>>>>>>>>>> makes a huge distinction between whether there's zero of something 
>>>>>>>>>>>> or one
>>>>>>>>>>>> of something, but you can work around that particular issue by 
>>>>>>>>>>>> setting the
>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which 
>>>>>>>>>>>> basically
>>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing 
>>>>>>>>>>>> query.
>>>>>>>>>>>>
>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running in your
>>>>>>>>>>>> case so that I can see if that is what you are encountering or not.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>
>>>>>>>>>>>>> We're beginning to see issues with a document count > 10
>>>>>>>>>>>>> million. At that point, even with good postgres vacuuming the
>>>>>>>>>>>>> jobqueue table is starting to become a bottleneck.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For example select count(*) from jobqueue, which is executed
>>>>>>>>>>>>> when querying job status will do a full table scan of jobqueue
>>>>>>>>>>>>> which has more than 10 million rows. That's going to take some 
>>>>>>>>>>>>> time in
>>>>>>>>>>>>> postgres.
>>>>>>>>>>>>>
>>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in
>>>>>>>>>>>>> postgres) but we are increasingly seeing long running queries
>>>>>>>>>>>>> in the logs. Our current thinking is that we'll need to refactor
>>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially
>>>>>>>>>>>>> partition jobqueue into a subset of tables (table per queue
>>>>>>>>>>>>> for example).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>
>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>
>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>> confidential and are intended solely for the use of the 
>>>>>>>>>>>>> individual to whom
>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those 
>>>>>>>>>>>>> of the
>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>> you are not the intended recipient of this email, you must 
>>>>>>>>>>>>> neither take any
>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. 
>>>>>>>>>>>>> Please
>>>>>>>>>>>>> contact the sender if you believe you have received this email in 
>>>>>>>>>>>>> error.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents processes
>>>>>>>>>>>>>> or parallel connectors.  Overall database performance is the 
>>>>>>>>>>>>>> limiting
>>>>>>>>>>>>>> factor.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2)
>>>>>>>>>>>>>> that discusses this issue.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million web
>>>>>>>>>>>>>> documents, using Postgresql 8.3.  Postgresql 9.x is faster, and 
>>>>>>>>>>>>>> with modern
>>>>>>>>>>>>>> SSD's, I expect that you will do even better.  In general, I'd 
>>>>>>>>>>>>>> say it was
>>>>>>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided 
>>>>>>>>>>>>>> that you use
>>>>>>>>>>>>>> a good database, and provided that you maintain it properly.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would like to know what is the maximum number of documents
>>>>>>>>>>>>>>> that you managed to crawl with ManifoldCF and with how many 
>>>>>>>>>>>>>>> connectors in
>>>>>>>>>>>>>>> parallel it could works ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Reply via email to