Hi Paul, The patches I included should work (at worst with some code offsets) against 1.6.1 code without problems. I suggest you try that first before going through an upgrade to 1.7.
Thanks, Karl On Thu, Sep 11, 2014 at 11:47 AM, Karl Wright <[email protected]> wrote: > Hi Paul, > > Upgrade is included automatically in MCF. > > If you are using the multiprocess example, just run the initialize script > and it will do the upgrade. The single process example does any required > upgrades every time you start it. > > Thanks, > Karl > > > > On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <[email protected]> > wrote: > >> Thanks - we've pulled down the branch and will test the changes. It looks >> like a branch of 1.7 so it's going to take us a little while to test. We >> need to migrate our connectors (there's some deprecated stuff that's now >> been cleared in 1.7 .eg. getShareACL) and we'll need to patch the database >> to include the pipeline and any other schema changes. We'll have some >> environment contention over the next week as our performance test >> environment needs to remain on 1.6.1 while we test a release. Once that's >> clear I'll move to 1.7 >> >> On the database schema patch moving from 1.6.1 to 1.7 - is there a simple >> way to migrate and existing database? >> >> Thanks, >> >> Paul >> >> >> >> >> >> VP Engineering, >> Exonar Ltd >> >> T: +44 7940 567724 >> >> twitter:@exonarco @pboichat >> W: http://www.exonar.com >> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> >> >> Exonar Limited, registered in the UK, registration number 06439969 at 14 >> West Mills, Newbury, Berkshire, RG14 5HG >> DISCLAIMER: This email and any attachments to it may be confidential and >> are intended solely for the use of the individual to whom it is addressed. >> Any views or opinions expressed are solely those of the author and do not >> necessarily represent those of Exonar Ltd. If you are not the intended >> recipient of this email, you must neither take any action based upon its >> contents, nor copy or show it to anyone. Please contact the sender if >> you believe you have received this email in error. >> >> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <[email protected]> wrote: >> >>> Thanks -- I'll include that change as well then, in ticket >>> CONNECTORS-1027. >>> >>> >>> Karl >>> >>> >>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> That comes back immediately with 10001 rows: >>>> >>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT >>>> 10001) t; >>>> >>>> QUERY PLAN >>>> >>>> >>>> ----------------------------------------------------------------------------------------------------------------------- >>>> ---------------------------------- >>>> Aggregate (cost=544.08..544.09 rows=1 width=0) (actual >>>> time=9.125..9.125 rows=1 loops=1) >>>> -> Limit (cost=0.00..419.07 rows=10001 width=0) (actual >>>> time=0.033..6.945 rows=10001 loops=1) >>>> -> Index Only Scan using jobqueue_pkey on jobqueue >>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time >>>> =0.031..3.257 rows=10001 loops=1) >>>> Heap Fetches: 725 >>>> Total runtime: 9.157 ms >>>> (5 rows) >>>> >>>> >>>> Whereas: >>>> >>>> explain analyze SELECT count(*) FROM jobqueue limit 10001; >>>> >>>> QUERY PLAN >>>> >>>> >>>> ----------------------------------------------------------------------------------------------------------------------- >>>> ---------------------------------------- >>>> Limit (cost=456922.99..456923.00 rows=1 width=0) (actual >>>> time=5225.107..5225.109 rows=1 loops=1) >>>> -> Aggregate (cost=456922.99..456923.00 rows=1 width=0) (actual >>>> time=5225.105..5225.106 rows=1 loops=1) >>>> -> Index Only Scan using jobqueue_pkey on jobqueue >>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time >>>> =0.108..3090.848 rows=10370209 loops=1) >>>> Heap Fetches: 684297 >>>> Total runtime: 5225.151 ms >>>> >>>> Thanks, >>>> >>>> Paul >>>> >>>> >>>> >>>> VP Engineering, >>>> Exonar Ltd >>>> >>>> T: +44 7940 567724 >>>> >>>> twitter:@exonarco @pboichat >>>> W: http://www.exonar.com >>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> >>>> >>>> Exonar Limited, registered in the UK, registration number 06439969 at 14 >>>> West Mills, Newbury, Berkshire, RG14 5HG >>>> DISCLAIMER: This email and any attachments to it may be confidential >>>> and are intended solely for the use of the individual to whom it is >>>> addressed. Any views or opinions expressed are solely those of the author >>>> and do not necessarily represent those of Exonar Ltd. If you are not >>>> the intended recipient of this email, you must neither take any action >>>> based upon its contents, nor copy or show it to anyone. Please contact >>>> the sender if you believe you have received this email in error. >>>> >>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <[email protected]> >>>> wrote: >>>> >>>>> Hi Paul, >>>>> >>>>> Could you try this query on your database please and tell me if it >>>>> executes promptly: >>>>> >>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t >>>>> >>>>> >>>>> I vaguely remember that I had to change the form of this query in >>>>> order to support MySQL -- but first let's see if this helps. >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <[email protected]> >>>>> wrote: >>>>> >>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch >>>>>> (branches/CONNECTORS-1027) for looking at any changes we do for >>>>>> large-scale >>>>>> Postgresql optimization work. >>>>>> >>>>>> Please note that trunk code already has schema changes relative to >>>>>> MCF 1.7, so you will not be able to work directly with this branch code. >>>>>> I'll have to create patches for whatever changes you would need to try. >>>>>> >>>>>> Thanks, >>>>>> Karl >>>>>> >>>>>> >>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> We're on Postgres 9.2. I'll get the query plans and add them to the >>>>>>> doc. >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Paul >>>>>>> >>>>>>> >>>>>>> >>>>>>> VP Engineering, >>>>>>> Exonar Ltd >>>>>>> >>>>>>> T: +44 7940 567724 >>>>>>> >>>>>>> twitter:@exonarco @pboichat >>>>>>> W: http://www.exonar.com >>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> >>>>>>> >>>>>>> Exonar Limited, registered in the UK, registration number 06439969 >>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>> confidential and are intended solely for the use of the individual to >>>>>>> whom >>>>>>> it is addressed. Any views or opinions expressed are solely those of the >>>>>>> author and do not necessarily represent those of Exonar Ltd. If you >>>>>>> are not the intended recipient of this email, you must neither take any >>>>>>> action based upon its contents, nor copy or show it to anyone. Please >>>>>>> contact the sender if you believe you have received this email in error. >>>>>>> >>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Paul, >>>>>>>> >>>>>>>> Can you include the logged plan for this query; this is an actual >>>>>>>> query encountered during crawling: >>>>>>>> >>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a >>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid >>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? >>>>>>>> AND >>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?] >>>>>>>> >>>>>>>> >>>>>>>> These queries are all from the UI; it is what gets generated when >>>>>>>> no limits are in place: >>>>>>>> >>>>>>>> WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a >>>>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS >>>>>>>> doccount >>>>>>>> FROM jobqueue t1 GROUP BY jobid] >>>>>>>> WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a >>>>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS >>>>>>>> doccount >>>>>>>> FROM jobqueue t1 GROUP BY jobid] >>>>>>>> >>>>>>>> This query is from the UI with a limit of 1000000: >>>>>>>> >>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a >>>>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM >>>>>>>> jobqueue t1 LIMIT 1000001] >>>>>>>> >>>>>>>> I honestly don't understand why PostgreSQL would execute a >>>>>>>> sequential scan of the entire table when given a limit clause. It >>>>>>>> certainly didn't used to do that. If you have any other suggestions >>>>>>>> please >>>>>>>> let me know. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Some queries show up in this list because MCF periodically >>>>>>>> reindexes tables. For example, this query goes only against the >>>>>>>> (small) >>>>>>>> jobs table. Its poor performance on occasion is likely due to >>>>>>>> something >>>>>>>> else happening to the database, probably a reindex: >>>>>>>> >>>>>>>> WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a >>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN >>>>>>>> (?,?,?,?,?) FOR UPDATE] >>>>>>>> >>>>>>>> >>>>>>>> The final query is the document stuffing query, which is perhaps >>>>>>>> the most critical query in the whole system: >>>>>>>> >>>>>>>> SELECT >>>>>>>> t0.id >>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount, >>>>>>>> t0.priorityset FROM jobqueue t0 >>>>>>>> WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND >>>>>>>> t0.checktime >>>>>>>> <= 1407246846166 >>>>>>>> AND EXISTS ( >>>>>>>> SELECT 'x' FROM jobs t1 >>>>>>>> WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND >>>>>>>> t1.priority=5 >>>>>>>> ) >>>>>>>> AND NOT EXISTS ( >>>>>>>> SELECT 'x' FROM jobqueue t2 >>>>>>>> WHERE t2.dochash=t0.dochash AND t2.status IN >>>>>>>> ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid >>>>>>>> ) >>>>>>>> AND NOT EXISTS ( >>>>>>>> SELECT 'x' FROM prereqevents t3,events t4 >>>>>>>> WHERE t0.id=t3.owner AND t3.eventname=t4.name >>>>>>>> ) >>>>>>>> ORDER BY t0.docpriority ASC >>>>>>>> LIMIT 480; >>>>>>>> >>>>>>>> Your analysis of whether IN beats OR does not agree with >>>>>>>> experiments I did on postgresql 8.7 which showed no difference. What >>>>>>>> Postgresql version are you using? Also, I trust you have query plans >>>>>>>> that >>>>>>>> demonstrate your claim? In any case, whether IN vs. OR is generated >>>>>>>> is a >>>>>>>> function of the MCF database driver, so this is trivial to experiment >>>>>>>> with. I'll create a ticket and a branch for experimentation. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hi Karl, >>>>>>>>> >>>>>>>>> Changing maxcountstatus to something much smaller (10,000) doesn't >>>>>>>>> seem to buy us that much on the table scan - in the attached you'll >>>>>>>>> see >>>>>>>>> that it's still taking a long time to return the job status page. >>>>>>>>> Also in >>>>>>>>> the attached are some sample other long running queries that we're >>>>>>>>> beginning to see more frequently. There's also an example of a query >>>>>>>>> that's >>>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested >>>>>>>>> change >>>>>>>>> to improve performance). This one in particular would certainly >>>>>>>>> benefit >>>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on >>>>>>>>> postgres. >>>>>>>>> >>>>>>>>> We're loading the system from 10mil towards 100mil so would be >>>>>>>>> keen to work with you to optimise where possible. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Paul >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> VP Engineering, >>>>>>>>> Exonar Ltd >>>>>>>>> >>>>>>>>> T: +44 7940 567724 >>>>>>>>> >>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>> W: http://www.exonar.com >>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>> <http://video.exonar.com/> >>>>>>>>> >>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969 >>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>>> confidential and are intended solely for the use of the individual to >>>>>>>>> whom >>>>>>>>> it is addressed. Any views or opinions expressed are solely those of >>>>>>>>> the >>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>>> you are not the intended recipient of this email, you must neither >>>>>>>>> take any >>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please >>>>>>>>> contact the sender if you believe you have received this email in >>>>>>>>> error. >>>>>>>>> >>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Paul, >>>>>>>>>> >>>>>>>>>> The jobstatus query that uses count(*) should be doing something >>>>>>>>>> like this when the maxdocumentstatuscount value is set: >>>>>>>>>> >>>>>>>>>> select count(*) from jobqueue where xxx limit 500001 >>>>>>>>>> >>>>>>>>>> This will still do a sequential scan, but it will be an aborted >>>>>>>>>> one, so you can control the maximum amount of time spent doing the >>>>>>>>>> query. >>>>>>>>>> >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it from >>>>>>>>>>> count(*)-ing but I'll certainly have another look to see if we've >>>>>>>>>>> missed >>>>>>>>>>> something. >>>>>>>>>>> >>>>>>>>>>> We're increasingly seeing long running threads and I'll put >>>>>>>>>>> together some samples. As an example, on a job that's currently >>>>>>>>>>> aborting: >>>>>>>>>>> >>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a >>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET >>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?] >>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Parameter >>>>>>>>>>> 0: '1.000000001E9' >>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Parameter >>>>>>>>>>> 1: '1407144048075' >>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: Update >>>>>>>>>>> on jobqueue (cost=18806.08..445770.39 rows=764916 width=287) >>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: -> >>>>>>>>>>> Bitmap Heap Scan on jobqueue (cost=18806.08..445770.39 rows=764916 >>>>>>>>>>> width=287) >>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>> Plan: Recheck Cond: (jobid = 1407144048075::bigint) >>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>> Plan: -> Bitmap Index Scan on i1392985450177 >>>>>>>>>>> (cost=0.00..18614.85 >>>>>>>>>>> rows=764916 width=0) >>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>> Plan: Index Cond: (jobid = 1407144048075::bigint) >>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>> WARN 2014-09-10 18:37:30,140 (Job reset thread) - Stats: >>>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P} >>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665} >>>>>>>>>>> WARN 2014-09-10 18:37:30,140 (Job reset thread) - >>>>>>>>>>> >>>>>>>>>>> Paul >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> VP Engineering, >>>>>>>>>>> Exonar Ltd >>>>>>>>>>> >>>>>>>>>>> T: +44 7940 567724 >>>>>>>>>>> >>>>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>>>> W: http://www.exonar.com >>>>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>>>> <http://video.exonar.com/> >>>>>>>>>>> >>>>>>>>>>> Exonar Limited, registered in the UK, registration number >>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>>>>> confidential and are intended solely for the use of the individual >>>>>>>>>>> to whom >>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those >>>>>>>>>>> of the >>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>>>>> you are not the intended recipient of this email, you must neither >>>>>>>>>>> take any >>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. >>>>>>>>>>> Please >>>>>>>>>>> contact the sender if you believe you have received this email in >>>>>>>>>>> error. >>>>>>>>>>> >>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <[email protected] >>>>>>>>>>> > wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Paul, >>>>>>>>>>>> >>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter you >>>>>>>>>>>> can set which limits the number of documents counted to at most a >>>>>>>>>>>> specified >>>>>>>>>>>> amount. This uses a limit clause, which should prevent unbounded >>>>>>>>>>>> time >>>>>>>>>>>> doing these kinds of queries: >>>>>>>>>>>> >>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount >>>>>>>>>>>> >>>>>>>>>>>> The documentation says that the default value for this >>>>>>>>>>>> parameter is 10000, which however is incorrect. The actual true >>>>>>>>>>>> default is >>>>>>>>>>>> 500000. You could set that lower for better UI performance >>>>>>>>>>>> (losing some >>>>>>>>>>>> information, of course.) >>>>>>>>>>>> >>>>>>>>>>>> As for long-running queries, a lot of time and effort has been >>>>>>>>>>>> spent in MCF to insure that this doesn't happen. Specifically, >>>>>>>>>>>> the main >>>>>>>>>>>> document queuing query is structured to read directly out of a >>>>>>>>>>>> specific >>>>>>>>>>>> jobqueue index. This is the crucial query that must work properly >>>>>>>>>>>> for >>>>>>>>>>>> scalability, since doing a query that is effectively just a sort >>>>>>>>>>>> on the >>>>>>>>>>>> entire jobqueue would be a major problem. There are some times >>>>>>>>>>>> where >>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly >>>>>>>>>>>> because it >>>>>>>>>>>> makes a huge distinction between whether there's zero of something >>>>>>>>>>>> or one >>>>>>>>>>>> of something, but you can work around that particular issue by >>>>>>>>>>>> setting the >>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which >>>>>>>>>>>> basically >>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing >>>>>>>>>>>> query. >>>>>>>>>>>> >>>>>>>>>>>> I'd appreciate seeing the queries that are long-running in your >>>>>>>>>>>> case so that I can see if that is what you are encountering or not. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Karl >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Karl, >>>>>>>>>>>>> >>>>>>>>>>>>> We're beginning to see issues with a document count > 10 >>>>>>>>>>>>> million. At that point, even with good postgres vacuuming the >>>>>>>>>>>>> jobqueue table is starting to become a bottleneck. >>>>>>>>>>>>> >>>>>>>>>>>>> For example select count(*) from jobqueue, which is executed >>>>>>>>>>>>> when querying job status will do a full table scan of jobqueue >>>>>>>>>>>>> which has more than 10 million rows. That's going to take some >>>>>>>>>>>>> time in >>>>>>>>>>>>> postgres. >>>>>>>>>>>>> >>>>>>>>>>>>> SSDs will certainly make a big difference to document >>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in >>>>>>>>>>>>> postgres) but we are increasingly seeing long running queries >>>>>>>>>>>>> in the logs. Our current thinking is that we'll need to refactor >>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially >>>>>>>>>>>>> partition jobqueue into a subset of tables (table per queue >>>>>>>>>>>>> for example). >>>>>>>>>>>>> >>>>>>>>>>>>> Paul >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> VP Engineering, >>>>>>>>>>>>> Exonar Ltd >>>>>>>>>>>>> >>>>>>>>>>>>> T: +44 7940 567724 >>>>>>>>>>>>> >>>>>>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>>>>>> W: http://www.exonar.com >>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>>>>>> <http://video.exonar.com/> >>>>>>>>>>>>> >>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number >>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>>>>>>> confidential and are intended solely for the use of the >>>>>>>>>>>>> individual to whom >>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those >>>>>>>>>>>>> of the >>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>>>>>>> you are not the intended recipient of this email, you must >>>>>>>>>>>>> neither take any >>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. >>>>>>>>>>>>> Please >>>>>>>>>>>>> contact the sender if you believe you have received this email in >>>>>>>>>>>>> error. >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Baptiste, >>>>>>>>>>>>>> >>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents processes >>>>>>>>>>>>>> or parallel connectors. Overall database performance is the >>>>>>>>>>>>>> limiting >>>>>>>>>>>>>> factor. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I would read this: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html >>>>>>>>>>>>>> >>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2) >>>>>>>>>>>>>> that discusses this issue. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million web >>>>>>>>>>>>>> documents, using Postgresql 8.3. Postgresql 9.x is faster, and >>>>>>>>>>>>>> with modern >>>>>>>>>>>>>> SSD's, I expect that you will do even better. In general, I'd >>>>>>>>>>>>>> say it was >>>>>>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided >>>>>>>>>>>>>> that you use >>>>>>>>>>>>>> a good database, and provided that you maintain it properly. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Karl >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I would like to know what is the maximum number of documents >>>>>>>>>>>>>>> that you managed to crawl with ManifoldCF and with how many >>>>>>>>>>>>>>> connectors in >>>>>>>>>>>>>>> parallel it could works ? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for your answer >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Baptiste >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
