Hi Paul, Zookeeper is fine; file system multiprocess, though, was just broken by a recent windows update, so don't use that.
Karl On Fri, Sep 12, 2014 at 2:34 PM, Paul Boichat <[email protected]> wrote: > Hi, > > Will do. Am also putting in some extra debug to help narrow it down. > > We're using zookeeper multiprocess. > > Paul > > > > VP Engineering, > Exonar Ltd > > T: +44 7940 567724 > > twitter:@exonarco @pboichat > W: http://www.exonar.com > Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> > > Exonar Limited, registered in the UK, registration number 06439969 at 14 > West Mills, Newbury, Berkshire, RG14 5HG > DISCLAIMER: This email and any attachments to it may be confidential and > are intended solely for the use of the individual to whom it is addressed. > Any views or opinions expressed are solely those of the author and do not > necessarily represent those of Exonar Ltd. If you are not the intended > recipient of this email, you must neither take any action based upon its > contents, nor copy or show it to anyone. Please contact the sender if you > believe you have received this email in error. > > On Fri, Sep 12, 2014 at 5:51 PM, Karl Wright <[email protected]> wrote: > >> It's done actually; my new laptop with PostgreSQL 9.3 can do 111111 >> documents in roughly 15 minutes. No problems encountered; roughly 120 >> documents per second. >> >> For sanity sake, could you try the following: >> >> - check out or unpack 1.6.1 sources >> - lay down downloaded lib dependencies >> - build using "ant build" >> - modify properties.xml and start the approparite example >> >> Please see if you have any problems doing this process *without* any >> patches. >> >> Also, what kind of synchronization are you using? File based, zookeeper, >> or single-process? >> >> Thanks, >> Karl >> >> >> On Fri, Sep 12, 2014 at 12:29 PM, Karl Wright <[email protected]> wrote: >> >>> Hi Paul, >>> >>> The query looks right; the database driver determines the maximum number >>> of clauses in a conjunction OR list, just like it does for an IN() list. >>> In the case of Postgresql and OR, the limit is 25; for IN()'s it's 100. >>> >>> The standard integration tests generally run small jobs but that is >>> typically sufficient to find query generation problems. I have load tests >>> I can also run but they take several hours to complete. I'll start one >>> now, but I may need to abort it before it finishes. >>> >>> Karl >>> >>> >>> On Fri, Sep 12, 2014 at 11:26 AM, Paul Boichat <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> I'm looking through the logs - can see the change from IN to OR in each >>>> query - and there's clearly a difference in execution path but it's quite >>>> verbose so will take a while. >>>> >>>> It may well be that document state has not been reprioritised or in >>>> some way inconsistent. However, I don't think it's that which is causing >>>> the issue - I can switch this behaviour on and off over by changing the >>>> DBInterfacePostgres class and restarting Manifold. That seems to suggest a >>>> query isn't behaving the same way between IN and OR - I just can't isolate >>>> the particular query (yet). >>>> >>>> Have you tested with a job already in running state (on a restart) with >>>> a large document count? For example am seeing this kind of thing which >>>> looks messy but appears to execute as you'd expect: >>>> >>>> SELECT >>>> id,dockey,lastversion,lastoutputversion,authorityname,forcedparams FROM >>>> ingeststatus WHERE (dockey=? OR dockey=? OR dockey=? OR dockey=? OR >>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=?) AND >>>> connectionname=?] >>>> DEBUG 2014-09-12 15:01:27,052 (Thread-542) - Parameter 0: >>>> '1407144048431:F42CD76D66FA6BAD396FF8F8A409DD211C184E6A' >>>> DEBUG 2014-09-12 15:01:27,052 (Thread-542) - Parameter 1: >>>> '1407144048431:FE66CC4054300E4EB2A84138DC9B62B80F59F5B9' >>>> >>>> >>>> >>>> >>>> VP Engineering, >>>> Exonar Ltd >>>> >>>> T: +44 7940 567724 >>>> >>>> twitter:@exonarco @pboichat >>>> W: http://www.exonar.com >>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> >>>> >>>> Exonar Limited, registered in the UK, registration number 06439969 at 14 >>>> West Mills, Newbury, Berkshire, RG14 5HG >>>> DISCLAIMER: This email and any attachments to it may be confidential >>>> and are intended solely for the use of the individual to whom it is >>>> addressed. Any views or opinions expressed are solely those of the author >>>> and do not necessarily represent those of Exonar Ltd. If you are not >>>> the intended recipient of this email, you must neither take any action >>>> based upon its contents, nor copy or show it to anyone. Please contact >>>> the sender if you believe you have received this email in error. >>>> >>>> On Fri, Sep 12, 2014 at 4:20 PM, Karl Wright <[email protected]> >>>> wrote: >>>> >>>>> Hi Paul, >>>>> >>>>> The tests in fact do multiple complete crawls, so it is extremely >>>>> unlikely that the stuffer query is broken. If you look at the queries >>>>> generated, you should note that the only difference is that whenever an >>>>> xxx >>>>> IN(?,?) was generated before, a (xxx=? OR xxx=?) is generated instead. >>>>> These should be completely equivalent; if they don't look equivalent to >>>>> you >>>>> in the log, then I will fix whatever is broken. I'll make sure here that >>>>> the queries look right visually too. >>>>> >>>>> One possibility is that when you restarted the agents process, the >>>>> jobqueue records did not yet finish getting reprioritized. Stuffer >>>>> queries >>>>> are fired all the time, but the running jobs must complete >>>>> reprioritization >>>>> before the stuffer query will pick up any records. I wonder if they may >>>>> not have managed to get to the right state before you aborted the >>>>> experiment? You can tell what is happening by using jstack to get a >>>>> thread >>>>> dump of the agents process. >>>>> >>>>> Thanks, >>>>> Karl >>>>> >>>>> >>>>> On Fri, Sep 12, 2014 at 11:05 AM, Paul Boichat < >>>>> [email protected]> wrote: >>>>> >>>>>> I stayed with base 1.6.1 and manually patched the code to include the >>>>>> two new methods in DBInterfacePostgreSQL >>>>>> >>>>>> Paul >>>>>> >>>>>> >>>>>> >>>>>> VP Engineering, >>>>>> Exonar Ltd >>>>>> >>>>>> T: +44 7940 567724 >>>>>> >>>>>> twitter:@exonarco @pboichat >>>>>> W: http://www.exonar.com >>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> >>>>>> >>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14 >>>>>> West Mills, Newbury, Berkshire, RG14 5HG >>>>>> DISCLAIMER: This email and any attachments to it may be confidential >>>>>> and are intended solely for the use of the individual to whom it is >>>>>> addressed. Any views or opinions expressed are solely those of the author >>>>>> and do not necessarily represent those of Exonar Ltd. If you are not >>>>>> the intended recipient of this email, you must neither take any action >>>>>> based upon its contents, nor copy or show it to anyone. Please >>>>>> contact the sender if you believe you have received this email in error. >>>>>> >>>>>> On Fri, Sep 12, 2014 at 4:01 PM, Karl Wright <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> The changes pass all tests here. Is it possible that you attempted >>>>>>> some upgrade that failed (or didn't attempt upgrade but went to a new >>>>>>> code >>>>>>> version)? >>>>>>> >>>>>>> If you could let me know as exactly as possible what you did, I can >>>>>>> let you know if that should have worked or not. >>>>>>> >>>>>>> Thanks! >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Sep 12, 2014 at 10:57 AM, Paul Boichat < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Karl, >>>>>>>> >>>>>>>> We appear to be seeing an issue with the performance change to use >>>>>>>> an OR clause rather than IN. After making the change, when we restart >>>>>>>> manifoldcf (with one job in running state) documents in the running >>>>>>>> job are >>>>>>>> not picked up for processing by the stuffer thread. If we redploy base >>>>>>>> 1.6.1 and restart documents are processed. This is consistently >>>>>>>> switchable >>>>>>>> depending on which code base is deployed. >>>>>>>> >>>>>>>> We have logs that I could upload to the ticket if you recommend >>>>>>>> that we reopen the issue (or create a new one)? >>>>>>>> >>>>>>>> Paul >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> VP Engineering, >>>>>>>> Exonar Ltd >>>>>>>> >>>>>>>> T: +44 7940 567724 >>>>>>>> >>>>>>>> twitter:@exonarco @pboichat >>>>>>>> W: http://www.exonar.com >>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>> <http://video.exonar.com/> >>>>>>>> >>>>>>>> Exonar Limited, registered in the UK, registration number 06439969 >>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>> confidential and are intended solely for the use of the individual to >>>>>>>> whom >>>>>>>> it is addressed. Any views or opinions expressed are solely those of >>>>>>>> the >>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>> you are not the intended recipient of this email, you must neither >>>>>>>> take any >>>>>>>> action based upon its contents, nor copy or show it to anyone. Please >>>>>>>> contact the sender if you believe you have received this email in >>>>>>>> error. >>>>>>>> >>>>>>>> On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Paul -- >>>>>>>>> >>>>>>>>> Just to be clear -- the branch for CONNECTORS-1027 is a branch of >>>>>>>>> trunk, which is MCF 2.0. MCF 2.0 is not backwards compatible with any >>>>>>>>> previous MCF release, and indeed there is no upgrade from any 1.x >>>>>>>>> release >>>>>>>>> to 2.0. That's why I said to use the patches, and try to stay on >>>>>>>>> 1.6.1 or >>>>>>>>> at most to migrate to 1.7. >>>>>>>>> >>>>>>>>> IF you ALREADY tried an upgrade with the branch code, then you >>>>>>>>> would have wound up in a schema state where the schema had more >>>>>>>>> columns in >>>>>>>>> it than the branch knew how to deal with. That's bad, and you will >>>>>>>>> need to >>>>>>>>> do things to fix the situation. I believe you should still be able >>>>>>>>> to do >>>>>>>>> the following: >>>>>>>>> >>>>>>>>> - Download 1.7 source, or check out >>>>>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch >>>>>>>>> - Apply the patches >>>>>>>>> - Build >>>>>>>>> - Modify your properties.xml to point to your postgresql instance >>>>>>>>> - Run the upgrade (initialize.bat on the multi-process example, or >>>>>>>>> start the single-process example) >>>>>>>>> >>>>>>>>> You should then have a working 1.7 release, with code patches >>>>>>>>> applied. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Thanks - we've pulled down the branch and will test the changes. >>>>>>>>>> It looks like a branch of 1.7 so it's going to take us a little >>>>>>>>>> while to >>>>>>>>>> test. We need to migrate our connectors (there's some deprecated >>>>>>>>>> stuff >>>>>>>>>> that's now been cleared in 1.7 .eg. getShareACL) and we'll need to >>>>>>>>>> patch >>>>>>>>>> the database to include the pipeline and any other schema changes. >>>>>>>>>> We'll >>>>>>>>>> have some environment contention over the next week as our >>>>>>>>>> performance test >>>>>>>>>> environment needs to remain on 1.6.1 while we test a release. Once >>>>>>>>>> that's >>>>>>>>>> clear I'll move to 1.7 >>>>>>>>>> >>>>>>>>>> On the database schema patch moving from 1.6.1 to 1.7 - is there >>>>>>>>>> a simple way to migrate and existing database? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Paul >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> VP Engineering, >>>>>>>>>> Exonar Ltd >>>>>>>>>> >>>>>>>>>> T: +44 7940 567724 >>>>>>>>>> >>>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>>> W: http://www.exonar.com >>>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>>> <http://video.exonar.com/> >>>>>>>>>> >>>>>>>>>> Exonar Limited, registered in the UK, registration number >>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>>>> confidential and are intended solely for the use of the individual >>>>>>>>>> to whom >>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of >>>>>>>>>> the >>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>>>> you are not the intended recipient of this email, you must neither >>>>>>>>>> take any >>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please >>>>>>>>>> contact the sender if you believe you have received this email in >>>>>>>>>> error. >>>>>>>>>> >>>>>>>>>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks -- I'll include that change as well then, in ticket >>>>>>>>>>> CONNECTORS-1027. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> That comes back immediately with 10001 rows: >>>>>>>>>>>> >>>>>>>>>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue >>>>>>>>>>>> LIMIT 10001) t; >>>>>>>>>>>> >>>>>>>>>>>> QUERY PLAN >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ----------------------------------------------------------------------------------------------------------------------- >>>>>>>>>>>> ---------------------------------- >>>>>>>>>>>> Aggregate (cost=544.08..544.09 rows=1 width=0) (actual >>>>>>>>>>>> time=9.125..9.125 rows=1 loops=1) >>>>>>>>>>>> -> Limit (cost=0.00..419.07 rows=10001 width=0) (actual >>>>>>>>>>>> time=0.033..6.945 rows=10001 loops=1) >>>>>>>>>>>> -> Index Only Scan using jobqueue_pkey on jobqueue >>>>>>>>>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time >>>>>>>>>>>> =0.031..3.257 rows=10001 loops=1) >>>>>>>>>>>> Heap Fetches: 725 >>>>>>>>>>>> Total runtime: 9.157 ms >>>>>>>>>>>> (5 rows) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Whereas: >>>>>>>>>>>> >>>>>>>>>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001; >>>>>>>>>>>> >>>>>>>>>>>> QUERY PLAN >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ----------------------------------------------------------------------------------------------------------------------- >>>>>>>>>>>> ---------------------------------------- >>>>>>>>>>>> Limit (cost=456922.99..456923.00 rows=1 width=0) (actual >>>>>>>>>>>> time=5225.107..5225.109 rows=1 loops=1) >>>>>>>>>>>> -> Aggregate (cost=456922.99..456923.00 rows=1 width=0) >>>>>>>>>>>> (actual time=5225.105..5225.106 rows=1 loops=1) >>>>>>>>>>>> -> Index Only Scan using jobqueue_pkey on jobqueue >>>>>>>>>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time >>>>>>>>>>>> =0.108..3090.848 rows=10370209 loops=1) >>>>>>>>>>>> Heap Fetches: 684297 >>>>>>>>>>>> Total runtime: 5225.151 ms >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> Paul >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> VP Engineering, >>>>>>>>>>>> Exonar Ltd >>>>>>>>>>>> >>>>>>>>>>>> T: +44 7940 567724 >>>>>>>>>>>> >>>>>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>>>>> W: http://www.exonar.com >>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>>>>> <http://video.exonar.com/> >>>>>>>>>>>> >>>>>>>>>>>> Exonar Limited, registered in the UK, registration number >>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>>>>>> confidential and are intended solely for the use of the individual >>>>>>>>>>>> to whom >>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those >>>>>>>>>>>> of the >>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>>>>>> you are not the intended recipient of this email, you must neither >>>>>>>>>>>> take any >>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. >>>>>>>>>>>> Please >>>>>>>>>>>> contact the sender if you believe you have received this email in >>>>>>>>>>>> error. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Paul, >>>>>>>>>>>>> >>>>>>>>>>>>> Could you try this query on your database please and tell me >>>>>>>>>>>>> if it executes promptly: >>>>>>>>>>>>> >>>>>>>>>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I vaguely remember that I had to change the form of this query >>>>>>>>>>>>> in order to support MySQL -- but first let's see if this helps. >>>>>>>>>>>>> >>>>>>>>>>>>> Karl >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based >>>>>>>>>>>>>> branch (branches/CONNECTORS-1027) for looking at any changes we >>>>>>>>>>>>>> do for >>>>>>>>>>>>>> large-scale Postgresql optimization work. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Please note that trunk code already has schema changes >>>>>>>>>>>>>> relative to MCF 1.7, so you will not be able to work directly >>>>>>>>>>>>>> with this >>>>>>>>>>>>>> branch code. I'll have to create patches for whatever changes >>>>>>>>>>>>>> you would >>>>>>>>>>>>>> need to try. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Karl >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them >>>>>>>>>>>>>>> to the doc. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Paul >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> VP Engineering, >>>>>>>>>>>>>>> Exonar Ltd >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> T: +44 7940 567724 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>>>>>>>> W: http://www.exonar.com >>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>>>>>>>> <http://video.exonar.com/> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number >>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>>>>>>>>> confidential and are intended solely for the use of the >>>>>>>>>>>>>>> individual to whom >>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely >>>>>>>>>>>>>>> those of the >>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>>>>>>>>> you are not the intended recipient of this email, you must >>>>>>>>>>>>>>> neither take any >>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. >>>>>>>>>>>>>>> Please >>>>>>>>>>>>>>> contact the sender if you believe you have received this email >>>>>>>>>>>>>>> in error. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Paul, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Can you include the logged plan for this query; this is an >>>>>>>>>>>>>>>> actual query encountered during crawling: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a >>>>>>>>>>>>>>>> long-running query (596499 ms): [SELECT >>>>>>>>>>>>>>>> t0.id,t0.dochash,t0.docid >>>>>>>>>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND >>>>>>>>>>>>>>>> t1.parentidhash=? AND >>>>>>>>>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> These queries are all from the UI; it is what gets >>>>>>>>>>>>>>>> generated when no limits are in place: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - >>>>>>>>>>>>>>>> Found a long-running query (166845 ms): [SELECT >>>>>>>>>>>>>>>> jobid,COUNT(dochash) AS >>>>>>>>>>>>>>>> doccount FROM jobqueue t1 GROUP BY jobid] >>>>>>>>>>>>>>>> WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - >>>>>>>>>>>>>>>> Found a long-running query (107222 ms): [SELECT >>>>>>>>>>>>>>>> jobid,COUNT(dochash) AS >>>>>>>>>>>>>>>> doccount FROM jobqueue t1 GROUP BY jobid] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This query is from the UI with a limit of 1000000: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - >>>>>>>>>>>>>>>> Found a long-running query (254851 ms): [SELECT COUNT(dochash) >>>>>>>>>>>>>>>> AS doccount >>>>>>>>>>>>>>>> FROM jobqueue t1 LIMIT 1000001] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I honestly don't understand why PostgreSQL would execute a >>>>>>>>>>>>>>>> sequential scan of the entire table when given a limit clause. >>>>>>>>>>>>>>>> It >>>>>>>>>>>>>>>> certainly didn't used to do that. If you have any other >>>>>>>>>>>>>>>> suggestions please >>>>>>>>>>>>>>>> let me know. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Some queries show up in this list because MCF periodically >>>>>>>>>>>>>>>> reindexes tables. For example, this query goes only against >>>>>>>>>>>>>>>> the (small) >>>>>>>>>>>>>>>> jobs table. Its poor performance on occasion is likely due to >>>>>>>>>>>>>>>> something >>>>>>>>>>>>>>>> else happening to the database, probably a reindex: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a >>>>>>>>>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE >>>>>>>>>>>>>>>> status IN >>>>>>>>>>>>>>>> (?,?,?,?,?) FOR UPDATE] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The final query is the document stuffing query, which is >>>>>>>>>>>>>>>> perhaps the most critical query in the whole system: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> SELECT >>>>>>>>>>>>>>>> t0.id >>>>>>>>>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount, >>>>>>>>>>>>>>>> t0.priorityset FROM jobqueue t0 >>>>>>>>>>>>>>>> WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND >>>>>>>>>>>>>>>> t0.checktime >>>>>>>>>>>>>>>> <= 1407246846166 >>>>>>>>>>>>>>>> AND EXISTS ( >>>>>>>>>>>>>>>> SELECT 'x' FROM jobs t1 >>>>>>>>>>>>>>>> WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND >>>>>>>>>>>>>>>> t1.priority=5 >>>>>>>>>>>>>>>> ) >>>>>>>>>>>>>>>> AND NOT EXISTS ( >>>>>>>>>>>>>>>> SELECT 'x' FROM jobqueue t2 >>>>>>>>>>>>>>>> WHERE t2.dochash=t0.dochash AND t2.status IN >>>>>>>>>>>>>>>> ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid >>>>>>>>>>>>>>>> ) >>>>>>>>>>>>>>>> AND NOT EXISTS ( >>>>>>>>>>>>>>>> SELECT 'x' FROM prereqevents t3,events t4 >>>>>>>>>>>>>>>> WHERE t0.id=t3.owner AND t3.eventname=t4.name >>>>>>>>>>>>>>>> ) >>>>>>>>>>>>>>>> ORDER BY t0.docpriority ASC >>>>>>>>>>>>>>>> LIMIT 480; >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Your analysis of whether IN beats OR does not agree with >>>>>>>>>>>>>>>> experiments I did on postgresql 8.7 which showed no >>>>>>>>>>>>>>>> difference. What >>>>>>>>>>>>>>>> Postgresql version are you using? Also, I trust you have >>>>>>>>>>>>>>>> query plans that >>>>>>>>>>>>>>>> demonstrate your claim? In any case, whether IN vs. OR is >>>>>>>>>>>>>>>> generated is a >>>>>>>>>>>>>>>> function of the MCF database driver, so this is trivial to >>>>>>>>>>>>>>>> experiment >>>>>>>>>>>>>>>> with. I'll create a ticket and a branch for experimentation. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Karl, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Changing maxcountstatus to something much smaller (10,000) >>>>>>>>>>>>>>>>> doesn't seem to buy us that much on the table scan - in the >>>>>>>>>>>>>>>>> attached you'll >>>>>>>>>>>>>>>>> see that it's still taking a long time to return the job >>>>>>>>>>>>>>>>> status page. Also >>>>>>>>>>>>>>>>> in the attached are some sample other long running queries >>>>>>>>>>>>>>>>> that we're >>>>>>>>>>>>>>>>> beginning to see more frequently. There's also an example of >>>>>>>>>>>>>>>>> a query that's >>>>>>>>>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a >>>>>>>>>>>>>>>>> suggested change >>>>>>>>>>>>>>>>> to improve performance). This one in particular would >>>>>>>>>>>>>>>>> certainly benefit >>>>>>>>>>>>>>>>> from a change to SSDs which should relieve the I/O bound >>>>>>>>>>>>>>>>> bottleneck on >>>>>>>>>>>>>>>>> postgres. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We're loading the system from 10mil towards 100mil so >>>>>>>>>>>>>>>>> would be keen to work with you to optimise where possible. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Paul >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> VP Engineering, >>>>>>>>>>>>>>>>> Exonar Ltd >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> T: +44 7940 567724 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>>>>>>>>>> W: http://www.exonar.com >>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>>>>>>>>>> <http://video.exonar.com/> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number >>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the >>>>>>>>>>>>>>>>> individual to whom >>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely >>>>>>>>>>>>>>>>> those of the >>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. >>>>>>>>>>>>>>>>> If >>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must >>>>>>>>>>>>>>>>> neither take any >>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to >>>>>>>>>>>>>>>>> anyone. Please >>>>>>>>>>>>>>>>> contact the sender if you believe you have received this >>>>>>>>>>>>>>>>> email in error. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Paul, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The jobstatus query that uses count(*) should be doing >>>>>>>>>>>>>>>>>> something like this when the maxdocumentstatuscount value is >>>>>>>>>>>>>>>>>> set: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> This will still do a sequential scan, but it will be an >>>>>>>>>>>>>>>>>> aborted one, so you can control the maximum amount of time >>>>>>>>>>>>>>>>>> spent doing the >>>>>>>>>>>>>>>>>> query. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop >>>>>>>>>>>>>>>>>>> it from count(*)-ing but I'll certainly have another look >>>>>>>>>>>>>>>>>>> to see if we've >>>>>>>>>>>>>>>>>>> missed something. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> We're increasingly seeing long running threads and I'll >>>>>>>>>>>>>>>>>>> put together some samples. As an example, on a job that's >>>>>>>>>>>>>>>>>>> currently >>>>>>>>>>>>>>>>>>> aborting: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found >>>>>>>>>>>>>>>>>>> a long-running query (72902 ms): [UPDATE jobqueue SET >>>>>>>>>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?] >>>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - >>>>>>>>>>>>>>>>>>> Parameter 0: '1.000000001E9' >>>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - >>>>>>>>>>>>>>>>>>> Parameter 1: '1407144048075' >>>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>>>>>>>>>> Plan: Update on jobqueue (cost=18806.08..445770.39 >>>>>>>>>>>>>>>>>>> rows=764916 width=287) >>>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>>>>>>>>>> Plan: -> Bitmap Heap Scan on jobqueue >>>>>>>>>>>>>>>>>>> (cost=18806.08..445770.39 >>>>>>>>>>>>>>>>>>> rows=764916 width=287) >>>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>>>>>>>>>> Plan: Recheck Cond: (jobid = 1407144048075::bigint) >>>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>>>>>>>>>> Plan: -> Bitmap Index Scan on i1392985450177 >>>>>>>>>>>>>>>>>>> (cost=0.00..18614.85 >>>>>>>>>>>>>>>>>>> rows=764916 width=0) >>>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>>>>>>>>>> Plan: Index Cond: (jobid = >>>>>>>>>>>>>>>>>>> 1407144048075::bigint) >>>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:30,140 (Job reset thread) - >>>>>>>>>>>>>>>>>>> Stats: n_distinct=4.0 most_common_vals={G,C,Z,P} >>>>>>>>>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665} >>>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:30,140 (Job reset thread) - >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Paul >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> VP Engineering, >>>>>>>>>>>>>>>>>>> Exonar Ltd >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> T: +44 7940 567724 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>>>>>>>>>>>> W: http://www.exonar.com >>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>>>>>>>>>>>> <http://video.exonar.com/> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration >>>>>>>>>>>>>>>>>>> number 06439969 at 14 West Mills, Newbury, Berkshire, >>>>>>>>>>>>>>>>>>> RG14 5HG >>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the >>>>>>>>>>>>>>>>>>> individual to whom >>>>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely >>>>>>>>>>>>>>>>>>> those of the >>>>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar >>>>>>>>>>>>>>>>>>> Ltd. If >>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must >>>>>>>>>>>>>>>>>>> neither take any >>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to >>>>>>>>>>>>>>>>>>> anyone. Please >>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this >>>>>>>>>>>>>>>>>>> email in error. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Paul, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> For the jobqueue scans from the UI, there is a >>>>>>>>>>>>>>>>>>>> parameter you can set which limits the number of documents >>>>>>>>>>>>>>>>>>>> counted to at >>>>>>>>>>>>>>>>>>>> most a specified amount. This uses a limit clause, which >>>>>>>>>>>>>>>>>>>> should prevent >>>>>>>>>>>>>>>>>>>> unbounded time doing these kinds of queries: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The documentation says that the default value for this >>>>>>>>>>>>>>>>>>>> parameter is 10000, which however is incorrect. The >>>>>>>>>>>>>>>>>>>> actual true default is >>>>>>>>>>>>>>>>>>>> 500000. You could set that lower for better UI >>>>>>>>>>>>>>>>>>>> performance (losing some >>>>>>>>>>>>>>>>>>>> information, of course.) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> As for long-running queries, a lot of time and effort >>>>>>>>>>>>>>>>>>>> has been spent in MCF to insure that this doesn't happen. >>>>>>>>>>>>>>>>>>>> Specifically, >>>>>>>>>>>>>>>>>>>> the main document queuing query is structured to read >>>>>>>>>>>>>>>>>>>> directly out of a >>>>>>>>>>>>>>>>>>>> specific jobqueue index. This is the crucial query that >>>>>>>>>>>>>>>>>>>> must work properly >>>>>>>>>>>>>>>>>>>> for scalability, since doing a query that is effectively >>>>>>>>>>>>>>>>>>>> just a sort on the >>>>>>>>>>>>>>>>>>>> entire jobqueue would be a major problem. There are some >>>>>>>>>>>>>>>>>>>> times where >>>>>>>>>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, >>>>>>>>>>>>>>>>>>>> mostly because it >>>>>>>>>>>>>>>>>>>> makes a huge distinction between whether there's zero of >>>>>>>>>>>>>>>>>>>> something or one >>>>>>>>>>>>>>>>>>>> of something, but you can work around that particular >>>>>>>>>>>>>>>>>>>> issue by setting the >>>>>>>>>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- >>>>>>>>>>>>>>>>>>>> which basically >>>>>>>>>>>>>>>>>>>> means that reanalysis of the table has to occur on every >>>>>>>>>>>>>>>>>>>> stuffing query. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running >>>>>>>>>>>>>>>>>>>> in your case so that I can see if that is what you are >>>>>>>>>>>>>>>>>>>> encountering or not. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Karl, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> We're beginning to see issues with a document count > >>>>>>>>>>>>>>>>>>>>> 10 million. At that point, even with good postgres >>>>>>>>>>>>>>>>>>>>> vacuuming the jobqueue table is starting to become a >>>>>>>>>>>>>>>>>>>>> bottleneck. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> For example select count(*) from jobqueue, which is >>>>>>>>>>>>>>>>>>>>> executed when querying job status will do a full table >>>>>>>>>>>>>>>>>>>>> scan of >>>>>>>>>>>>>>>>>>>>> jobqueue which has more than 10 million rows. That's >>>>>>>>>>>>>>>>>>>>> going to take some time in postgres. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> SSDs will certainly make a big difference to document >>>>>>>>>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound >>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>> postgres) but we are increasingly seeing long running >>>>>>>>>>>>>>>>>>>>> queries in the logs. Our current thinking is that we'll >>>>>>>>>>>>>>>>>>>>> need to refactor >>>>>>>>>>>>>>>>>>>>> JobQueue somewhat to optimise queries and, >>>>>>>>>>>>>>>>>>>>> potentially partition jobqueue into a subset of >>>>>>>>>>>>>>>>>>>>> tables (table per queue for example). >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Paul >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> VP Engineering, >>>>>>>>>>>>>>>>>>>>> Exonar Ltd >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> T: +44 7940 567724 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>>>>>>>>>>>>>> W: http://www.exonar.com >>>>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>>>>>>>>>>>>>> <http://video.exonar.com/> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration >>>>>>>>>>>>>>>>>>>>> number 06439969 at 14 West Mills, Newbury, Berkshire, >>>>>>>>>>>>>>>>>>>>> RG14 5HG >>>>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may >>>>>>>>>>>>>>>>>>>>> be confidential and are intended solely for the use of >>>>>>>>>>>>>>>>>>>>> the individual to >>>>>>>>>>>>>>>>>>>>> whom it is addressed. Any views or opinions expressed are >>>>>>>>>>>>>>>>>>>>> solely those of >>>>>>>>>>>>>>>>>>>>> the author and do not necessarily represent those of >>>>>>>>>>>>>>>>>>>>> Exonar Ltd. If >>>>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you >>>>>>>>>>>>>>>>>>>>> must neither take any >>>>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to >>>>>>>>>>>>>>>>>>>>> anyone. Please >>>>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this >>>>>>>>>>>>>>>>>>>>> email in error. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright < >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi Baptiste, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents >>>>>>>>>>>>>>>>>>>>>> processes or parallel connectors. Overall database >>>>>>>>>>>>>>>>>>>>>> performance is the >>>>>>>>>>>>>>>>>>>>>> limiting factor. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I would read this: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe >>>>>>>>>>>>>>>>>>>>>> Chapter 2) that discusses this issue. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million >>>>>>>>>>>>>>>>>>>>>> web documents, using Postgresql 8.3. Postgresql 9.x is >>>>>>>>>>>>>>>>>>>>>> faster, and with >>>>>>>>>>>>>>>>>>>>>> modern SSD's, I expect that you will do even better. In >>>>>>>>>>>>>>>>>>>>>> general, I'd say >>>>>>>>>>>>>>>>>>>>>> it was fine to shoot for 10M - 100M documents on >>>>>>>>>>>>>>>>>>>>>> ManifoldCF, provided that >>>>>>>>>>>>>>>>>>>>>> you use a good database, and provided that you maintain >>>>>>>>>>>>>>>>>>>>>> it properly. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier < >>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I would like to know what is the maximum number of >>>>>>>>>>>>>>>>>>>>>>> documents that you managed to crawl with ManifoldCF and >>>>>>>>>>>>>>>>>>>>>>> with how many >>>>>>>>>>>>>>>>>>>>>>> connectors in parallel it could works ? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks for your answer >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Baptiste >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
