The changes pass all tests here. Is it possible that you attempted some upgrade that failed (or didn't attempt upgrade but went to a new code version)?
If you could let me know as exactly as possible what you did, I can let you know if that should have worked or not. Thanks! Karl On Fri, Sep 12, 2014 at 10:57 AM, Paul Boichat <[email protected]> wrote: > Karl, > > We appear to be seeing an issue with the performance change to use an OR > clause rather than IN. After making the change, when we restart manifoldcf > (with one job in running state) documents in the running job are not picked > up for processing by the stuffer thread. If we redploy base 1.6.1 and > restart documents are processed. This is consistently switchable depending > on which code base is deployed. > > We have logs that I could upload to the ticket if you recommend that we > reopen the issue (or create a new one)? > > Paul > > > > > > VP Engineering, > Exonar Ltd > > T: +44 7940 567724 > > twitter:@exonarco @pboichat > W: http://www.exonar.com > Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> > > Exonar Limited, registered in the UK, registration number 06439969 at 14 > West Mills, Newbury, Berkshire, RG14 5HG > DISCLAIMER: This email and any attachments to it may be confidential and > are intended solely for the use of the individual to whom it is addressed. > Any views or opinions expressed are solely those of the author and do not > necessarily represent those of Exonar Ltd. If you are not the intended > recipient of this email, you must neither take any action based upon its > contents, nor copy or show it to anyone. Please contact the sender if you > believe you have received this email in error. > > On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <[email protected]> wrote: > >> Hi Paul -- >> >> Just to be clear -- the branch for CONNECTORS-1027 is a branch of trunk, >> which is MCF 2.0. MCF 2.0 is not backwards compatible with any previous >> MCF release, and indeed there is no upgrade from any 1.x release to 2.0. >> That's why I said to use the patches, and try to stay on 1.6.1 or at most >> to migrate to 1.7. >> >> IF you ALREADY tried an upgrade with the branch code, then you would have >> wound up in a schema state where the schema had more columns in it than the >> branch knew how to deal with. That's bad, and you will need to do things >> to fix the situation. I believe you should still be able to do the >> following: >> >> - Download 1.7 source, or check out >> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch >> - Apply the patches >> - Build >> - Modify your properties.xml to point to your postgresql instance >> - Run the upgrade (initialize.bat on the multi-process example, or start >> the single-process example) >> >> You should then have a working 1.7 release, with code patches applied. >> >> Thanks, >> Karl >> >> >> >> >> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <[email protected]> >> wrote: >> >>> Thanks - we've pulled down the branch and will test the changes. It >>> looks like a branch of 1.7 so it's going to take us a little while to test. >>> We need to migrate our connectors (there's some deprecated stuff that's now >>> been cleared in 1.7 .eg. getShareACL) and we'll need to patch the database >>> to include the pipeline and any other schema changes. We'll have some >>> environment contention over the next week as our performance test >>> environment needs to remain on 1.6.1 while we test a release. Once that's >>> clear I'll move to 1.7 >>> >>> On the database schema patch moving from 1.6.1 to 1.7 - is there a >>> simple way to migrate and existing database? >>> >>> Thanks, >>> >>> Paul >>> >>> >>> >>> >>> >>> VP Engineering, >>> Exonar Ltd >>> >>> T: +44 7940 567724 >>> >>> twitter:@exonarco @pboichat >>> W: http://www.exonar.com >>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> >>> >>> Exonar Limited, registered in the UK, registration number 06439969 at 14 >>> West Mills, Newbury, Berkshire, RG14 5HG >>> DISCLAIMER: This email and any attachments to it may be confidential >>> and are intended solely for the use of the individual to whom it is >>> addressed. Any views or opinions expressed are solely those of the author >>> and do not necessarily represent those of Exonar Ltd. If you are not >>> the intended recipient of this email, you must neither take any action >>> based upon its contents, nor copy or show it to anyone. Please contact >>> the sender if you believe you have received this email in error. >>> >>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <[email protected]> wrote: >>> >>>> Thanks -- I'll include that change as well then, in ticket >>>> CONNECTORS-1027. >>>> >>>> >>>> Karl >>>> >>>> >>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> That comes back immediately with 10001 rows: >>>>> >>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT >>>>> 10001) t; >>>>> >>>>> QUERY PLAN >>>>> >>>>> >>>>> ----------------------------------------------------------------------------------------------------------------------- >>>>> ---------------------------------- >>>>> Aggregate (cost=544.08..544.09 rows=1 width=0) (actual >>>>> time=9.125..9.125 rows=1 loops=1) >>>>> -> Limit (cost=0.00..419.07 rows=10001 width=0) (actual >>>>> time=0.033..6.945 rows=10001 loops=1) >>>>> -> Index Only Scan using jobqueue_pkey on jobqueue >>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time >>>>> =0.031..3.257 rows=10001 loops=1) >>>>> Heap Fetches: 725 >>>>> Total runtime: 9.157 ms >>>>> (5 rows) >>>>> >>>>> >>>>> Whereas: >>>>> >>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001; >>>>> >>>>> QUERY PLAN >>>>> >>>>> >>>>> ----------------------------------------------------------------------------------------------------------------------- >>>>> ---------------------------------------- >>>>> Limit (cost=456922.99..456923.00 rows=1 width=0) (actual >>>>> time=5225.107..5225.109 rows=1 loops=1) >>>>> -> Aggregate (cost=456922.99..456923.00 rows=1 width=0) (actual >>>>> time=5225.105..5225.106 rows=1 loops=1) >>>>> -> Index Only Scan using jobqueue_pkey on jobqueue >>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time >>>>> =0.108..3090.848 rows=10370209 loops=1) >>>>> Heap Fetches: 684297 >>>>> Total runtime: 5225.151 ms >>>>> >>>>> Thanks, >>>>> >>>>> Paul >>>>> >>>>> >>>>> >>>>> VP Engineering, >>>>> Exonar Ltd >>>>> >>>>> T: +44 7940 567724 >>>>> >>>>> twitter:@exonarco @pboichat >>>>> W: http://www.exonar.com >>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> >>>>> >>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14 >>>>> West Mills, Newbury, Berkshire, RG14 5HG >>>>> DISCLAIMER: This email and any attachments to it may be confidential >>>>> and are intended solely for the use of the individual to whom it is >>>>> addressed. Any views or opinions expressed are solely those of the author >>>>> and do not necessarily represent those of Exonar Ltd. If you are not >>>>> the intended recipient of this email, you must neither take any action >>>>> based upon its contents, nor copy or show it to anyone. Please >>>>> contact the sender if you believe you have received this email in error. >>>>> >>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Paul, >>>>>> >>>>>> Could you try this query on your database please and tell me if it >>>>>> executes promptly: >>>>>> >>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t >>>>>> >>>>>> >>>>>> I vaguely remember that I had to change the form of this query in >>>>>> order to support MySQL -- but first let's see if this helps. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch >>>>>>> (branches/CONNECTORS-1027) for looking at any changes we do for >>>>>>> large-scale >>>>>>> Postgresql optimization work. >>>>>>> >>>>>>> Please note that trunk code already has schema changes relative to >>>>>>> MCF 1.7, so you will not be able to work directly with this branch code. >>>>>>> I'll have to create patches for whatever changes you would need to try. >>>>>>> >>>>>>> Thanks, >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them to the >>>>>>>> doc. >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> Paul >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> VP Engineering, >>>>>>>> Exonar Ltd >>>>>>>> >>>>>>>> T: +44 7940 567724 >>>>>>>> >>>>>>>> twitter:@exonarco @pboichat >>>>>>>> W: http://www.exonar.com >>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>> <http://video.exonar.com/> >>>>>>>> >>>>>>>> Exonar Limited, registered in the UK, registration number 06439969 >>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>> confidential and are intended solely for the use of the individual to >>>>>>>> whom >>>>>>>> it is addressed. Any views or opinions expressed are solely those of >>>>>>>> the >>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>> you are not the intended recipient of this email, you must neither >>>>>>>> take any >>>>>>>> action based upon its contents, nor copy or show it to anyone. Please >>>>>>>> contact the sender if you believe you have received this email in >>>>>>>> error. >>>>>>>> >>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Paul, >>>>>>>>> >>>>>>>>> Can you include the logged plan for this query; this is an actual >>>>>>>>> query encountered during crawling: >>>>>>>>> >>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a >>>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid >>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? >>>>>>>>> AND >>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?] >>>>>>>>> >>>>>>>>> >>>>>>>>> These queries are all from the UI; it is what gets generated when >>>>>>>>> no limits are in place: >>>>>>>>> >>>>>>>>> WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a >>>>>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS >>>>>>>>> doccount >>>>>>>>> FROM jobqueue t1 GROUP BY jobid] >>>>>>>>> WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a >>>>>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS >>>>>>>>> doccount >>>>>>>>> FROM jobqueue t1 GROUP BY jobid] >>>>>>>>> >>>>>>>>> This query is from the UI with a limit of 1000000: >>>>>>>>> >>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a >>>>>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount >>>>>>>>> FROM >>>>>>>>> jobqueue t1 LIMIT 1000001] >>>>>>>>> >>>>>>>>> I honestly don't understand why PostgreSQL would execute a >>>>>>>>> sequential scan of the entire table when given a limit clause. It >>>>>>>>> certainly didn't used to do that. If you have any other suggestions >>>>>>>>> please >>>>>>>>> let me know. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Some queries show up in this list because MCF periodically >>>>>>>>> reindexes tables. For example, this query goes only against the >>>>>>>>> (small) >>>>>>>>> jobs table. Its poor performance on occasion is likely due to >>>>>>>>> something >>>>>>>>> else happening to the database, probably a reindex: >>>>>>>>> >>>>>>>>> WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a >>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN >>>>>>>>> (?,?,?,?,?) FOR UPDATE] >>>>>>>>> >>>>>>>>> >>>>>>>>> The final query is the document stuffing query, which is perhaps >>>>>>>>> the most critical query in the whole system: >>>>>>>>> >>>>>>>>> SELECT >>>>>>>>> t0.id >>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount, >>>>>>>>> t0.priorityset FROM jobqueue t0 >>>>>>>>> WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND >>>>>>>>> t0.checktime >>>>>>>>> <= 1407246846166 >>>>>>>>> AND EXISTS ( >>>>>>>>> SELECT 'x' FROM jobs t1 >>>>>>>>> WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND >>>>>>>>> t1.priority=5 >>>>>>>>> ) >>>>>>>>> AND NOT EXISTS ( >>>>>>>>> SELECT 'x' FROM jobqueue t2 >>>>>>>>> WHERE t2.dochash=t0.dochash AND t2.status IN >>>>>>>>> ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid >>>>>>>>> ) >>>>>>>>> AND NOT EXISTS ( >>>>>>>>> SELECT 'x' FROM prereqevents t3,events t4 >>>>>>>>> WHERE t0.id=t3.owner AND t3.eventname=t4.name >>>>>>>>> ) >>>>>>>>> ORDER BY t0.docpriority ASC >>>>>>>>> LIMIT 480; >>>>>>>>> >>>>>>>>> Your analysis of whether IN beats OR does not agree with >>>>>>>>> experiments I did on postgresql 8.7 which showed no difference. What >>>>>>>>> Postgresql version are you using? Also, I trust you have query plans >>>>>>>>> that >>>>>>>>> demonstrate your claim? In any case, whether IN vs. OR is generated >>>>>>>>> is a >>>>>>>>> function of the MCF database driver, so this is trivial to experiment >>>>>>>>> with. I'll create a ticket and a branch for experimentation. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi Karl, >>>>>>>>>> >>>>>>>>>> Changing maxcountstatus to something much smaller (10,000) >>>>>>>>>> doesn't seem to buy us that much on the table scan - in the attached >>>>>>>>>> you'll >>>>>>>>>> see that it's still taking a long time to return the job status >>>>>>>>>> page. Also >>>>>>>>>> in the attached are some sample other long running queries that we're >>>>>>>>>> beginning to see more frequently. There's also an example of a query >>>>>>>>>> that's >>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested >>>>>>>>>> change >>>>>>>>>> to improve performance). This one in particular would certainly >>>>>>>>>> benefit >>>>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck >>>>>>>>>> on >>>>>>>>>> postgres. >>>>>>>>>> >>>>>>>>>> We're loading the system from 10mil towards 100mil so would be >>>>>>>>>> keen to work with you to optimise where possible. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Paul >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> VP Engineering, >>>>>>>>>> Exonar Ltd >>>>>>>>>> >>>>>>>>>> T: +44 7940 567724 >>>>>>>>>> >>>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>>> W: http://www.exonar.com >>>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>>> <http://video.exonar.com/> >>>>>>>>>> >>>>>>>>>> Exonar Limited, registered in the UK, registration number >>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>>>> confidential and are intended solely for the use of the individual >>>>>>>>>> to whom >>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of >>>>>>>>>> the >>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>>>> you are not the intended recipient of this email, you must neither >>>>>>>>>> take any >>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please >>>>>>>>>> contact the sender if you believe you have received this email in >>>>>>>>>> error. >>>>>>>>>> >>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Paul, >>>>>>>>>>> >>>>>>>>>>> The jobstatus query that uses count(*) should be doing something >>>>>>>>>>> like this when the maxdocumentstatuscount value is set: >>>>>>>>>>> >>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001 >>>>>>>>>>> >>>>>>>>>>> This will still do a sequential scan, but it will be an aborted >>>>>>>>>>> one, so you can control the maximum amount of time spent doing the >>>>>>>>>>> query. >>>>>>>>>>> >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it from >>>>>>>>>>>> count(*)-ing but I'll certainly have another look to see if we've >>>>>>>>>>>> missed >>>>>>>>>>>> something. >>>>>>>>>>>> >>>>>>>>>>>> We're increasingly seeing long running threads and I'll put >>>>>>>>>>>> together some samples. As an example, on a job that's currently >>>>>>>>>>>> aborting: >>>>>>>>>>>> >>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a >>>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET >>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?] >>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Parameter >>>>>>>>>>>> 0: '1.000000001E9' >>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Parameter >>>>>>>>>>>> 1: '1407144048075' >>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: >>>>>>>>>>>> Update on jobqueue (cost=18806.08..445770.39 rows=764916 >>>>>>>>>>>> width=287) >>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: -> >>>>>>>>>>>> Bitmap Heap Scan on jobqueue (cost=18806.08..445770.39 rows=764916 >>>>>>>>>>>> width=287) >>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>>> Plan: Recheck Cond: (jobid = 1407144048075::bigint) >>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>>> Plan: -> Bitmap Index Scan on i1392985450177 >>>>>>>>>>>> (cost=0.00..18614.85 >>>>>>>>>>>> rows=764916 width=0) >>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>>> Plan: Index Cond: (jobid = 1407144048075::bigint) >>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>>> WARN 2014-09-10 18:37:30,140 (Job reset thread) - Stats: >>>>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P} >>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665} >>>>>>>>>>>> WARN 2014-09-10 18:37:30,140 (Job reset thread) - >>>>>>>>>>>> >>>>>>>>>>>> Paul >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> VP Engineering, >>>>>>>>>>>> Exonar Ltd >>>>>>>>>>>> >>>>>>>>>>>> T: +44 7940 567724 >>>>>>>>>>>> >>>>>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>>>>> W: http://www.exonar.com >>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>>>>> <http://video.exonar.com/> >>>>>>>>>>>> >>>>>>>>>>>> Exonar Limited, registered in the UK, registration number >>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>>>>>> confidential and are intended solely for the use of the individual >>>>>>>>>>>> to whom >>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those >>>>>>>>>>>> of the >>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>>>>>> you are not the intended recipient of this email, you must neither >>>>>>>>>>>> take any >>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. >>>>>>>>>>>> Please >>>>>>>>>>>> contact the sender if you believe you have received this email in >>>>>>>>>>>> error. >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Paul, >>>>>>>>>>>>> >>>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter you >>>>>>>>>>>>> can set which limits the number of documents counted to at most a >>>>>>>>>>>>> specified >>>>>>>>>>>>> amount. This uses a limit clause, which should prevent unbounded >>>>>>>>>>>>> time >>>>>>>>>>>>> doing these kinds of queries: >>>>>>>>>>>>> >>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount >>>>>>>>>>>>> >>>>>>>>>>>>> The documentation says that the default value for this >>>>>>>>>>>>> parameter is 10000, which however is incorrect. The actual true >>>>>>>>>>>>> default is >>>>>>>>>>>>> 500000. You could set that lower for better UI performance >>>>>>>>>>>>> (losing some >>>>>>>>>>>>> information, of course.) >>>>>>>>>>>>> >>>>>>>>>>>>> As for long-running queries, a lot of time and effort has been >>>>>>>>>>>>> spent in MCF to insure that this doesn't happen. Specifically, >>>>>>>>>>>>> the main >>>>>>>>>>>>> document queuing query is structured to read directly out of a >>>>>>>>>>>>> specific >>>>>>>>>>>>> jobqueue index. This is the crucial query that must work >>>>>>>>>>>>> properly for >>>>>>>>>>>>> scalability, since doing a query that is effectively just a sort >>>>>>>>>>>>> on the >>>>>>>>>>>>> entire jobqueue would be a major problem. There are some times >>>>>>>>>>>>> where >>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly >>>>>>>>>>>>> because it >>>>>>>>>>>>> makes a huge distinction between whether there's zero of >>>>>>>>>>>>> something or one >>>>>>>>>>>>> of something, but you can work around that particular issue by >>>>>>>>>>>>> setting the >>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which >>>>>>>>>>>>> basically >>>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing >>>>>>>>>>>>> query. >>>>>>>>>>>>> >>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running in >>>>>>>>>>>>> your case so that I can see if that is what you are encountering >>>>>>>>>>>>> or not. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Karl >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Karl, >>>>>>>>>>>>>> >>>>>>>>>>>>>> We're beginning to see issues with a document count > 10 >>>>>>>>>>>>>> million. At that point, even with good postgres vacuuming >>>>>>>>>>>>>> the jobqueue table is starting to become a bottleneck. >>>>>>>>>>>>>> >>>>>>>>>>>>>> For example select count(*) from jobqueue, which is executed >>>>>>>>>>>>>> when querying job status will do a full table scan of >>>>>>>>>>>>>> jobqueue which has more than 10 million rows. That's going >>>>>>>>>>>>>> to take some time in postgres. >>>>>>>>>>>>>> >>>>>>>>>>>>>> SSDs will certainly make a big difference to document >>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in >>>>>>>>>>>>>> postgres) but we are increasingly seeing long running >>>>>>>>>>>>>> queries in the logs. Our current thinking is that we'll need to >>>>>>>>>>>>>> refactor >>>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially >>>>>>>>>>>>>> partition jobqueue into a subset of tables (table per queue >>>>>>>>>>>>>> for example). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Paul >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> VP Engineering, >>>>>>>>>>>>>> Exonar Ltd >>>>>>>>>>>>>> >>>>>>>>>>>>>> T: +44 7940 567724 >>>>>>>>>>>>>> >>>>>>>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>>>>>>> W: http://www.exonar.com >>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>>>>>>> <http://video.exonar.com/> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number >>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>>>>>>>> confidential and are intended solely for the use of the >>>>>>>>>>>>>> individual to whom >>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely >>>>>>>>>>>>>> those of the >>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>>>>>>>> you are not the intended recipient of this email, you must >>>>>>>>>>>>>> neither take any >>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. >>>>>>>>>>>>>> Please >>>>>>>>>>>>>> contact the sender if you believe you have received this email >>>>>>>>>>>>>> in error. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Baptiste, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents processes >>>>>>>>>>>>>>> or parallel connectors. Overall database performance is the >>>>>>>>>>>>>>> limiting >>>>>>>>>>>>>>> factor. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I would read this: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2) >>>>>>>>>>>>>>> that discusses this issue. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million web >>>>>>>>>>>>>>> documents, using Postgresql 8.3. Postgresql 9.x is faster, and >>>>>>>>>>>>>>> with modern >>>>>>>>>>>>>>> SSD's, I expect that you will do even better. In general, I'd >>>>>>>>>>>>>>> say it was >>>>>>>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided >>>>>>>>>>>>>>> that you use >>>>>>>>>>>>>>> a good database, and provided that you maintain it properly. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I would like to know what is the maximum number of >>>>>>>>>>>>>>>> documents that you managed to crawl with ManifoldCF and with >>>>>>>>>>>>>>>> how many >>>>>>>>>>>>>>>> connectors in parallel it could works ? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for your answer >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Baptiste >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
