Re: Apache ManifoldCF Performance

Karl Wright Fri, 12 Sep 2014 11:11:04 -0700

A longer (63 minute) postgresql load test just completed, also without any
problem.


Are you sure you included the patch code correctly?

Karl


On Fri, Sep 12, 2014 at 12:51 PM, Karl Wright <[email protected]> wrote:

> It's done actually; my new laptop with PostgreSQL 9.3 can do 111111
> documents in roughly 15 minutes.  No problems encountered; roughly 120
> documents per second.
>
> For sanity sake, could you try the following:
>
> - check out or unpack 1.6.1 sources
> - lay down downloaded lib dependencies
> - build using "ant build"
> - modify properties.xml and start the approparite example
>
> Please see if you have any problems doing this process *without* any
> patches.
>
> Also, what kind of synchronization are you using?  File based, zookeeper,
> or single-process?
>
> Thanks,
> Karl
>
>
> On Fri, Sep 12, 2014 at 12:29 PM, Karl Wright <[email protected]> wrote:
>
>> Hi Paul,
>>
>> The query looks right; the database driver determines the maximum number
>> of clauses in a conjunction OR list, just like it does for an IN() list.
>> In the case of Postgresql and OR, the limit is 25; for IN()'s it's 100.
>>
>> The standard integration tests generally run small jobs but that is
>> typically sufficient to find query generation problems.  I have load tests
>> I can also run but they take several hours to complete.  I'll start one
>> now, but I may need to abort it before it finishes.
>>
>> Karl
>>
>>
>> On Fri, Sep 12, 2014 at 11:26 AM, Paul Boichat <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm looking through the logs - can see the change from IN to OR in each
>>> query - and there's clearly a difference in execution path but it's quite
>>> verbose so will take a while.
>>>
>>> It may well be that document state has not been reprioritised or in some
>>> way inconsistent. However, I don't think it's that which is causing the
>>> issue - I can switch this behaviour on and off over by changing the
>>> DBInterfacePostgres class and restarting Manifold. That seems to suggest a
>>> query isn't behaving the same way between IN and OR - I just can't isolate
>>> the particular query (yet).
>>>
>>> Have you tested with a job already in running state (on a restart) with
>>> a large document count? For example am seeing this kind of thing which
>>> looks messy but appears to execute as you'd expect:
>>>
>>> SELECT
>>> id,dockey,lastversion,lastoutputversion,authorityname,forcedparams FROM
>>> ingeststatus WHERE  (dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=?) AND
>>> connectionname=?]
>>> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 0:
>>> '1407144048431:F42CD76D66FA6BAD396FF8F8A409DD211C184E6A'
>>> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 1:
>>> '1407144048431:FE66CC4054300E4EB2A84138DC9B62B80F59F5B9'
>>>
>>>
>>>
>>>
>>> VP Engineering,
>>> Exonar Ltd
>>>
>>> T: +44 7940 567724
>>>
>>> twitter:@exonarco @pboichat
>>> W: http://www.exonar.com
>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>
>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>> West Mills, Newbury, Berkshire, RG14 5HG
>>> DISCLAIMER: This email and any attachments to it may be confidential
>>> and are intended solely for the use of the individual to whom it is
>>> addressed. Any views or opinions expressed are solely those of the author
>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>> the intended recipient of this email, you must neither take any action
>>> based upon its contents, nor copy or show it to anyone. Please contact
>>> the sender if you believe you have received this email in error.
>>>
>>> On Fri, Sep 12, 2014 at 4:20 PM, Karl Wright <[email protected]> wrote:
>>>
>>>> Hi Paul,
>>>>
>>>> The tests in fact do multiple complete crawls, so it is extremely
>>>> unlikely that the stuffer query is broken.  If you look at the queries
>>>> generated, you should note that the only difference is that whenever an xxx
>>>> IN(?,?) was generated before, a (xxx=? OR xxx=?) is generated instead.
>>>> These should be completely equivalent; if they don't look equivalent to you
>>>> in the log, then I will fix whatever is broken.  I'll make sure here that
>>>> the queries look right visually too.
>>>>
>>>> One possibility is that when you restarted the agents process, the
>>>> jobqueue records did not yet finish getting reprioritized.  Stuffer queries
>>>> are fired all the time, but the running jobs must complete reprioritization
>>>> before the stuffer query will pick up any records.  I wonder if they may
>>>> not have managed to get to the right state before you aborted the
>>>> experiment?  You can tell what is happening by using jstack to get a thread
>>>> dump of the agents process.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Fri, Sep 12, 2014 at 11:05 AM, Paul Boichat <[email protected]
>>>> > wrote:
>>>>
>>>>> I stayed with base 1.6.1 and manually patched the code to include the
>>>>> two new methods in DBInterfacePostgreSQL
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>> VP Engineering,
>>>>> Exonar Ltd
>>>>>
>>>>> T: +44 7940 567724
>>>>>
>>>>> twitter:@exonarco @pboichat
>>>>> W: http://www.exonar.com
>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>
>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>> and are intended solely for the use of the individual to whom it is
>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>> the intended recipient of this email, you must neither take any action
>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>> contact the sender if you believe you have received this email in error.
>>>>>
>>>>> On Fri, Sep 12, 2014 at 4:01 PM, Karl Wright <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> The changes pass all tests here.  Is it possible that you attempted
>>>>>> some upgrade that failed (or didn't attempt upgrade but went to a new 
>>>>>> code
>>>>>> version)?
>>>>>>
>>>>>> If you could let me know as exactly as possible what you did, I can
>>>>>> let you know if that should have worked or not.
>>>>>>
>>>>>> Thanks!
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 12, 2014 at 10:57 AM, Paul Boichat <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Karl,
>>>>>>>
>>>>>>> We appear to be seeing an issue with the performance change to use
>>>>>>> an OR clause rather than IN. After making the change, when we restart
>>>>>>> manifoldcf (with one job in running state) documents in the running job 
>>>>>>> are
>>>>>>> not picked up for processing by the stuffer thread. If we redploy base
>>>>>>> 1.6.1 and restart documents are processed. This is consistently 
>>>>>>> switchable
>>>>>>> depending on which code base is deployed.
>>>>>>>
>>>>>>> We have logs that I could upload to the ticket if you recommend that
>>>>>>> we reopen the issue (or create a new one)?
>>>>>>>
>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> VP Engineering,
>>>>>>> Exonar Ltd
>>>>>>>
>>>>>>> T: +44 7940 567724
>>>>>>>
>>>>>>> twitter:@exonarco @pboichat
>>>>>>> W: http://www.exonar.com
>>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>>
>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>> confidential and are intended solely for the use of the individual to 
>>>>>>> whom
>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>> author and do not necessarily represent those of Exonar Ltd. If you
>>>>>>> are not the intended recipient of this email, you must neither take any
>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>
>>>>>>> On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Paul --
>>>>>>>>
>>>>>>>> Just to be clear -- the branch for CONNECTORS-1027 is a branch of
>>>>>>>> trunk, which is MCF 2.0.  MCF 2.0 is not backwards compatible with any
>>>>>>>> previous MCF release, and indeed there is no upgrade from any 1.x 
>>>>>>>> release
>>>>>>>> to 2.0.  That's why I said to use the patches, and try to stay on 
>>>>>>>> 1.6.1 or
>>>>>>>> at most to migrate to 1.7.
>>>>>>>>
>>>>>>>> IF you ALREADY tried an upgrade with the branch code, then you
>>>>>>>> would have wound up in a schema state where the schema had more 
>>>>>>>> columns in
>>>>>>>> it than the branch knew how to deal with.  That's bad, and you will 
>>>>>>>> need to
>>>>>>>> do things to fix the situation.  I believe you should still be able to 
>>>>>>>> do
>>>>>>>> the following:
>>>>>>>>
>>>>>>>> - Download 1.7 source, or check out
>>>>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch
>>>>>>>> - Apply the patches
>>>>>>>> - Build
>>>>>>>> - Modify your properties.xml to point to your postgresql instance
>>>>>>>> - Run the upgrade (initialize.bat on the multi-process example, or
>>>>>>>> start the single-process example)
>>>>>>>>
>>>>>>>> You should then have a working 1.7 release, with code patches
>>>>>>>> applied.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Thanks - we've pulled down the branch and will test the changes.
>>>>>>>>> It looks like a branch of 1.7 so it's going to take us a little while 
>>>>>>>>> to
>>>>>>>>> test. We need to migrate our connectors (there's some deprecated stuff
>>>>>>>>> that's now been cleared in 1.7 .eg. getShareACL) and we'll need to 
>>>>>>>>> patch
>>>>>>>>> the database to include the pipeline and any other schema changes. 
>>>>>>>>> We'll
>>>>>>>>> have some environment contention over the next week as our 
>>>>>>>>> performance test
>>>>>>>>> environment needs to remain on 1.6.1 while we test a release. Once 
>>>>>>>>> that's
>>>>>>>>> clear I'll move to 1.7
>>>>>>>>>
>>>>>>>>> On the database schema patch moving from 1.6.1 to 1.7 - is there a
>>>>>>>>> simple way to migrate and existing database?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Paul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> VP Engineering,
>>>>>>>>> Exonar Ltd
>>>>>>>>>
>>>>>>>>> T: +44 7940 567724
>>>>>>>>>
>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>> W: http://www.exonar.com
>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>
>>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>> confidential and are intended solely for the use of the individual to 
>>>>>>>>> whom
>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of 
>>>>>>>>> the
>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>> you are not the intended recipient of this email, you must neither 
>>>>>>>>> take any
>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>> contact the sender if you believe you have received this email in 
>>>>>>>>> error.
>>>>>>>>>
>>>>>>>>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks -- I'll include that change as well then, in ticket
>>>>>>>>>> CONNECTORS-1027.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> That comes back immediately with 10001 rows:
>>>>>>>>>>>
>>>>>>>>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue
>>>>>>>>>>> LIMIT 10001) t;
>>>>>>>>>>>
>>>>>>>>>>> QUERY PLAN
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>> ----------------------------------
>>>>>>>>>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>>>>>>>>>> time=9.125..9.125 rows=1 loops=1)
>>>>>>>>>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>>>>>>>>>> time=0.033..6.945 rows=10001 loops=1)
>>>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>>>>>>>>>> =0.031..3.257 rows=10001 loops=1)
>>>>>>>>>>>                Heap Fetches: 725
>>>>>>>>>>>  Total runtime: 9.157 ms
>>>>>>>>>>> (5 rows)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Whereas:
>>>>>>>>>>>
>>>>>>>>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>>>>>>>>>
>>>>>>>>>>> QUERY PLAN
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>>>>>>>> time=5225.107..5225.109 rows=1 loops=1)
>>>>>>>>>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0)
>>>>>>>>>>> (actual time=5225.105..5225.106 rows=1 loops=1)
>>>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>>>>>>>>>> =0.108..3090.848 rows=10370209 loops=1)
>>>>>>>>>>>                Heap Fetches: 684297
>>>>>>>>>>>  Total runtime: 5225.151 ms
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Paul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> VP Engineering,
>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>
>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>
>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>
>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>> confidential and are intended solely for the use of the individual 
>>>>>>>>>>> to whom
>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those 
>>>>>>>>>>> of the
>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>> you are not the intended recipient of this email, you must neither 
>>>>>>>>>>> take any
>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. 
>>>>>>>>>>> Please
>>>>>>>>>>> contact the sender if you believe you have received this email in 
>>>>>>>>>>> error.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>
>>>>>>>>>>>> Could you try this query on your database please and tell me if
>>>>>>>>>>>> it executes promptly:
>>>>>>>>>>>>
>>>>>>>>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I vaguely remember that I had to change the form of this query
>>>>>>>>>>>> in order to support MySQL -- but first let's see if this helps.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based
>>>>>>>>>>>>> branch (branches/CONNECTORS-1027) for looking at any changes we 
>>>>>>>>>>>>> do for
>>>>>>>>>>>>> large-scale Postgresql optimization work.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please note that trunk code already has schema changes
>>>>>>>>>>>>> relative to MCF 1.7, so you will not be able to work directly 
>>>>>>>>>>>>> with this
>>>>>>>>>>>>> branch code.  I'll have to create patches for whatever changes 
>>>>>>>>>>>>> you would
>>>>>>>>>>>>> need to try.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them
>>>>>>>>>>>>>> to the doc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>> confidential and are intended solely for the use of the 
>>>>>>>>>>>>>> individual to whom
>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely 
>>>>>>>>>>>>>> those of the
>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>> you are not the intended recipient of this email, you must 
>>>>>>>>>>>>>> neither take any
>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. 
>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>> contact the sender if you believe you have received this email 
>>>>>>>>>>>>>> in error.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can you include the logged plan for this query; this is an
>>>>>>>>>>>>>>> actual query encountered during crawling:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>>>>>>>>>> long-running query (596499 ms): [SELECT 
>>>>>>>>>>>>>>> t0.id,t0.dochash,t0.docid
>>>>>>>>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND 
>>>>>>>>>>>>>>> t1.parentidhash=? AND
>>>>>>>>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> These queries are all from the UI; it is what gets generated
>>>>>>>>>>>>>>> when no limits are in place:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found
>>>>>>>>>>>>>>> a long-running query (166845 ms): [SELECT jobid,COUNT(dochash) 
>>>>>>>>>>>>>>> AS doccount
>>>>>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found
>>>>>>>>>>>>>>> a long-running query (107222 ms): [SELECT jobid,COUNT(dochash) 
>>>>>>>>>>>>>>> AS doccount
>>>>>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found
>>>>>>>>>>>>>>> a long-running query (254851 ms): [SELECT COUNT(dochash) AS 
>>>>>>>>>>>>>>> doccount FROM
>>>>>>>>>>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>>>>>>>>>> sequential scan of the entire table when given a limit clause.  
>>>>>>>>>>>>>>> It
>>>>>>>>>>>>>>> certainly didn't used to do that.  If you have any other 
>>>>>>>>>>>>>>> suggestions please
>>>>>>>>>>>>>>> let me know.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Some queries show up in this list because MCF periodically
>>>>>>>>>>>>>>> reindexes tables.  For example, this query goes only against 
>>>>>>>>>>>>>>> the (small)
>>>>>>>>>>>>>>> jobs table.  Its poor performance on occasion is likely due to 
>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>> else happening to the database, probably a reindex:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>>>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE 
>>>>>>>>>>>>>>> status IN
>>>>>>>>>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The final query is the document stuffing query, which is
>>>>>>>>>>>>>>> perhaps the most critical query in the whole system:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  SELECT
>>>>>>>>>>>>>>>  t0.id
>>>>>>>>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>>>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>>>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>>>>>>>>>> t0.checktime
>>>>>>>>>>>>>>>  <= 1407246846166
>>>>>>>>>>>>>>>  AND EXISTS (
>>>>>>>>>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>>>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>>>>>>>>>> t1.priority=5
>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>>>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>>>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>>>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>>>>>>>>>  LIMIT 480;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Your analysis of whether IN beats OR does not agree with
>>>>>>>>>>>>>>> experiments I did on postgresql 8.7 which showed no difference. 
>>>>>>>>>>>>>>>  What
>>>>>>>>>>>>>>> Postgresql version are you using?  Also, I trust you have query 
>>>>>>>>>>>>>>> plans that
>>>>>>>>>>>>>>> demonstrate your claim?  In any case, whether IN vs. OR is 
>>>>>>>>>>>>>>> generated is a
>>>>>>>>>>>>>>> function of the MCF database driver, so this is trivial to 
>>>>>>>>>>>>>>> experiment
>>>>>>>>>>>>>>> with.  I'll create a ticket and a branch for experimentation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Changing maxcountstatus to something much smaller (10,000)
>>>>>>>>>>>>>>>> doesn't seem to buy us that much on the table scan - in the 
>>>>>>>>>>>>>>>> attached you'll
>>>>>>>>>>>>>>>> see that it's still taking a long time to return the job 
>>>>>>>>>>>>>>>> status page. Also
>>>>>>>>>>>>>>>> in the attached are some sample other long running queries 
>>>>>>>>>>>>>>>> that we're
>>>>>>>>>>>>>>>> beginning to see more frequently. There's also an example of a 
>>>>>>>>>>>>>>>> query that's
>>>>>>>>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a 
>>>>>>>>>>>>>>>> suggested change
>>>>>>>>>>>>>>>> to improve performance). This one in particular would 
>>>>>>>>>>>>>>>> certainly benefit
>>>>>>>>>>>>>>>> from a change to SSDs which should relieve the I/O bound 
>>>>>>>>>>>>>>>> bottleneck on
>>>>>>>>>>>>>>>> postgres.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We're loading the system from 10mil towards 100mil so would
>>>>>>>>>>>>>>>> be keen to work with you to optimise where possible.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the 
>>>>>>>>>>>>>>>> individual to whom
>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely 
>>>>>>>>>>>>>>>> those of the
>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must 
>>>>>>>>>>>>>>>> neither take any
>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. 
>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email 
>>>>>>>>>>>>>>>> in error.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The jobstatus query that uses count(*) should be doing
>>>>>>>>>>>>>>>>> something like this when the maxdocumentstatuscount value is 
>>>>>>>>>>>>>>>>> set:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This will still do a sequential scan, but it will be an
>>>>>>>>>>>>>>>>> aborted one, so you can control the maximum amount of time 
>>>>>>>>>>>>>>>>> spent doing the
>>>>>>>>>>>>>>>>> query.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it
>>>>>>>>>>>>>>>>>> from count(*)-ing but I'll certainly have another look to 
>>>>>>>>>>>>>>>>>> see if we've
>>>>>>>>>>>>>>>>>> missed something.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We're increasingly seeing long running threads and I'll
>>>>>>>>>>>>>>>>>> put together some samples. As an example, on a job that's 
>>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>> aborting:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>>>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Parameter 0: '1.000000001E9'
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Parameter 1: '1407144048075'
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>>>>>>>>>> Update on jobqueue  (cost=18806.08..445770.39 rows=764916 
>>>>>>>>>>>>>>>>>> width=287)
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Plan:   ->  Bitmap Heap Scan on jobqueue  
>>>>>>>>>>>>>>>>>> (cost=18806.08..445770.39
>>>>>>>>>>>>>>>>>> rows=764916 width=287)
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Plan:         Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Plan:         ->  Bitmap Index Scan on i1392985450177  
>>>>>>>>>>>>>>>>>> (cost=0.00..18614.85
>>>>>>>>>>>>>>>>>> rows=764916 width=0)
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Plan:               Index Cond: (jobid = 
>>>>>>>>>>>>>>>>>> 1407144048075::bigint)
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Stats: n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the 
>>>>>>>>>>>>>>>>>> individual to whom
>>>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely 
>>>>>>>>>>>>>>>>>> those of the
>>>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. 
>>>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must 
>>>>>>>>>>>>>>>>>> neither take any
>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to 
>>>>>>>>>>>>>>>>>> anyone. Please
>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this 
>>>>>>>>>>>>>>>>>> email in error.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter
>>>>>>>>>>>>>>>>>>> you can set which limits the number of documents counted to 
>>>>>>>>>>>>>>>>>>> at most a
>>>>>>>>>>>>>>>>>>> specified amount.  This uses a limit clause, which should 
>>>>>>>>>>>>>>>>>>> prevent unbounded
>>>>>>>>>>>>>>>>>>> time doing these kinds of queries:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The documentation says that the default value for this
>>>>>>>>>>>>>>>>>>> parameter is 10000, which however is incorrect.  The actual 
>>>>>>>>>>>>>>>>>>> true default is
>>>>>>>>>>>>>>>>>>> 500000.  You could set that lower for better UI performance 
>>>>>>>>>>>>>>>>>>> (losing some
>>>>>>>>>>>>>>>>>>> information, of course.)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As for long-running queries, a lot of time and effort
>>>>>>>>>>>>>>>>>>> has been spent in MCF to insure that this doesn't happen.  
>>>>>>>>>>>>>>>>>>> Specifically,
>>>>>>>>>>>>>>>>>>> the main document queuing query is structured to read 
>>>>>>>>>>>>>>>>>>> directly out of a
>>>>>>>>>>>>>>>>>>> specific jobqueue index.  This is the crucial query that 
>>>>>>>>>>>>>>>>>>> must work properly
>>>>>>>>>>>>>>>>>>> for scalability, since doing a query that is effectively 
>>>>>>>>>>>>>>>>>>> just a sort on the
>>>>>>>>>>>>>>>>>>> entire jobqueue would be a major problem.  There are some 
>>>>>>>>>>>>>>>>>>> times where
>>>>>>>>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, 
>>>>>>>>>>>>>>>>>>> mostly because it
>>>>>>>>>>>>>>>>>>> makes a huge distinction between whether there's zero of 
>>>>>>>>>>>>>>>>>>> something or one
>>>>>>>>>>>>>>>>>>> of something, but you can work around that particular issue 
>>>>>>>>>>>>>>>>>>> by setting the
>>>>>>>>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- 
>>>>>>>>>>>>>>>>>>> which basically
>>>>>>>>>>>>>>>>>>> means that reanalysis of the table has to occur on every 
>>>>>>>>>>>>>>>>>>> stuffing query.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running
>>>>>>>>>>>>>>>>>>> in your case so that I can see if that is what you are 
>>>>>>>>>>>>>>>>>>> encountering or not.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> We're beginning to see issues with a document count >
>>>>>>>>>>>>>>>>>>>> 10 million. At that point, even with good postgres
>>>>>>>>>>>>>>>>>>>> vacuuming the jobqueue table is starting to become a
>>>>>>>>>>>>>>>>>>>> bottleneck.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> For example select count(*) from jobqueue, which is
>>>>>>>>>>>>>>>>>>>> executed when querying job status will do a full table 
>>>>>>>>>>>>>>>>>>>> scan of
>>>>>>>>>>>>>>>>>>>> jobqueue which has more than 10 million rows. That's
>>>>>>>>>>>>>>>>>>>> going to take some time in postgres.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound 
>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> postgres) but we are increasingly seeing long running
>>>>>>>>>>>>>>>>>>>> queries in the logs. Our current thinking is that we'll 
>>>>>>>>>>>>>>>>>>>> need to refactor
>>>>>>>>>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially
>>>>>>>>>>>>>>>>>>>> partition jobqueue into a subset of tables (table per
>>>>>>>>>>>>>>>>>>>> queue for example).
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration
>>>>>>>>>>>>>>>>>>>> number 06439969 at 14 West Mills, Newbury, Berkshire,
>>>>>>>>>>>>>>>>>>>> RG14 5HG
>>>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may
>>>>>>>>>>>>>>>>>>>> be confidential and are intended solely for the use of the 
>>>>>>>>>>>>>>>>>>>> individual to
>>>>>>>>>>>>>>>>>>>> whom it is addressed. Any views or opinions expressed are 
>>>>>>>>>>>>>>>>>>>> solely those of
>>>>>>>>>>>>>>>>>>>> the author and do not necessarily represent those of 
>>>>>>>>>>>>>>>>>>>> Exonar Ltd. If
>>>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must 
>>>>>>>>>>>>>>>>>>>> neither take any
>>>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to 
>>>>>>>>>>>>>>>>>>>> anyone. Please
>>>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this 
>>>>>>>>>>>>>>>>>>>> email in error.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents
>>>>>>>>>>>>>>>>>>>>> processes or parallel connectors.  Overall database 
>>>>>>>>>>>>>>>>>>>>> performance is the
>>>>>>>>>>>>>>>>>>>>> limiting factor.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe
>>>>>>>>>>>>>>>>>>>>> Chapter 2) that discusses this issue.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million
>>>>>>>>>>>>>>>>>>>>> web documents, using Postgresql 8.3.  Postgresql 9.x is 
>>>>>>>>>>>>>>>>>>>>> faster, and with
>>>>>>>>>>>>>>>>>>>>> modern SSD's, I expect that you will do even better.  In 
>>>>>>>>>>>>>>>>>>>>> general, I'd say
>>>>>>>>>>>>>>>>>>>>> it was fine to shoot for 10M - 100M documents on 
>>>>>>>>>>>>>>>>>>>>> ManifoldCF, provided that
>>>>>>>>>>>>>>>>>>>>> you use a good database, and provided that you maintain 
>>>>>>>>>>>>>>>>>>>>> it properly.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I would like to know what is the maximum number of
>>>>>>>>>>>>>>>>>>>>>> documents that you managed to crawl with ManifoldCF and 
>>>>>>>>>>>>>>>>>>>>>> with how many
>>>>>>>>>>>>>>>>>>>>>> connectors in parallel it could works ?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Reply via email to