It's done actually; my new laptop with PostgreSQL 9.3 can do 111111 documents in roughly 15 minutes. No problems encountered; roughly 120 documents per second.
For sanity sake, could you try the following: - check out or unpack 1.6.1 sources - lay down downloaded lib dependencies - build using "ant build" - modify properties.xml and start the approparite example Please see if you have any problems doing this process *without* any patches. Also, what kind of synchronization are you using? File based, zookeeper, or single-process? Thanks, Karl On Fri, Sep 12, 2014 at 12:29 PM, Karl Wright <[email protected]> wrote: > Hi Paul, > > The query looks right; the database driver determines the maximum number > of clauses in a conjunction OR list, just like it does for an IN() list. > In the case of Postgresql and OR, the limit is 25; for IN()'s it's 100. > > The standard integration tests generally run small jobs but that is > typically sufficient to find query generation problems. I have load tests > I can also run but they take several hours to complete. I'll start one > now, but I may need to abort it before it finishes. > > Karl > > > On Fri, Sep 12, 2014 at 11:26 AM, Paul Boichat <[email protected]> > wrote: > >> Hi, >> >> I'm looking through the logs - can see the change from IN to OR in each >> query - and there's clearly a difference in execution path but it's quite >> verbose so will take a while. >> >> It may well be that document state has not been reprioritised or in some >> way inconsistent. However, I don't think it's that which is causing the >> issue - I can switch this behaviour on and off over by changing the >> DBInterfacePostgres class and restarting Manifold. That seems to suggest a >> query isn't behaving the same way between IN and OR - I just can't isolate >> the particular query (yet). >> >> Have you tested with a job already in running state (on a restart) with a >> large document count? For example am seeing this kind of thing which looks >> messy but appears to execute as you'd expect: >> >> SELECT id,dockey,lastversion,lastoutputversion,authorityname,forcedparams >> FROM ingeststatus WHERE (dockey=? OR dockey=? OR dockey=? OR dockey=? OR >> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR >> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=?) AND >> connectionname=?] >> DEBUG 2014-09-12 15:01:27,052 (Thread-542) - Parameter 0: >> '1407144048431:F42CD76D66FA6BAD396FF8F8A409DD211C184E6A' >> DEBUG 2014-09-12 15:01:27,052 (Thread-542) - Parameter 1: >> '1407144048431:FE66CC4054300E4EB2A84138DC9B62B80F59F5B9' >> >> >> >> >> VP Engineering, >> Exonar Ltd >> >> T: +44 7940 567724 >> >> twitter:@exonarco @pboichat >> W: http://www.exonar.com >> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> >> >> Exonar Limited, registered in the UK, registration number 06439969 at 14 >> West Mills, Newbury, Berkshire, RG14 5HG >> DISCLAIMER: This email and any attachments to it may be confidential and >> are intended solely for the use of the individual to whom it is addressed. >> Any views or opinions expressed are solely those of the author and do not >> necessarily represent those of Exonar Ltd. If you are not the intended >> recipient of this email, you must neither take any action based upon its >> contents, nor copy or show it to anyone. Please contact the sender if >> you believe you have received this email in error. >> >> On Fri, Sep 12, 2014 at 4:20 PM, Karl Wright <[email protected]> wrote: >> >>> Hi Paul, >>> >>> The tests in fact do multiple complete crawls, so it is extremely >>> unlikely that the stuffer query is broken. If you look at the queries >>> generated, you should note that the only difference is that whenever an xxx >>> IN(?,?) was generated before, a (xxx=? OR xxx=?) is generated instead. >>> These should be completely equivalent; if they don't look equivalent to you >>> in the log, then I will fix whatever is broken. I'll make sure here that >>> the queries look right visually too. >>> >>> One possibility is that when you restarted the agents process, the >>> jobqueue records did not yet finish getting reprioritized. Stuffer queries >>> are fired all the time, but the running jobs must complete reprioritization >>> before the stuffer query will pick up any records. I wonder if they may >>> not have managed to get to the right state before you aborted the >>> experiment? You can tell what is happening by using jstack to get a thread >>> dump of the agents process. >>> >>> Thanks, >>> Karl >>> >>> >>> On Fri, Sep 12, 2014 at 11:05 AM, Paul Boichat <[email protected]> >>> wrote: >>> >>>> I stayed with base 1.6.1 and manually patched the code to include the >>>> two new methods in DBInterfacePostgreSQL >>>> >>>> Paul >>>> >>>> >>>> >>>> VP Engineering, >>>> Exonar Ltd >>>> >>>> T: +44 7940 567724 >>>> >>>> twitter:@exonarco @pboichat >>>> W: http://www.exonar.com >>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> >>>> >>>> Exonar Limited, registered in the UK, registration number 06439969 at 14 >>>> West Mills, Newbury, Berkshire, RG14 5HG >>>> DISCLAIMER: This email and any attachments to it may be confidential >>>> and are intended solely for the use of the individual to whom it is >>>> addressed. Any views or opinions expressed are solely those of the author >>>> and do not necessarily represent those of Exonar Ltd. If you are not >>>> the intended recipient of this email, you must neither take any action >>>> based upon its contents, nor copy or show it to anyone. Please contact >>>> the sender if you believe you have received this email in error. >>>> >>>> On Fri, Sep 12, 2014 at 4:01 PM, Karl Wright <[email protected]> >>>> wrote: >>>> >>>>> The changes pass all tests here. Is it possible that you attempted >>>>> some upgrade that failed (or didn't attempt upgrade but went to a new code >>>>> version)? >>>>> >>>>> If you could let me know as exactly as possible what you did, I can >>>>> let you know if that should have worked or not. >>>>> >>>>> Thanks! >>>>> Karl >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Sep 12, 2014 at 10:57 AM, Paul Boichat < >>>>> [email protected]> wrote: >>>>> >>>>>> Karl, >>>>>> >>>>>> We appear to be seeing an issue with the performance change to use an >>>>>> OR clause rather than IN. After making the change, when we restart >>>>>> manifoldcf (with one job in running state) documents in the running job >>>>>> are >>>>>> not picked up for processing by the stuffer thread. If we redploy base >>>>>> 1.6.1 and restart documents are processed. This is consistently >>>>>> switchable >>>>>> depending on which code base is deployed. >>>>>> >>>>>> We have logs that I could upload to the ticket if you recommend that >>>>>> we reopen the issue (or create a new one)? >>>>>> >>>>>> Paul >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> VP Engineering, >>>>>> Exonar Ltd >>>>>> >>>>>> T: +44 7940 567724 >>>>>> >>>>>> twitter:@exonarco @pboichat >>>>>> W: http://www.exonar.com >>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> >>>>>> >>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14 >>>>>> West Mills, Newbury, Berkshire, RG14 5HG >>>>>> DISCLAIMER: This email and any attachments to it may be confidential >>>>>> and are intended solely for the use of the individual to whom it is >>>>>> addressed. Any views or opinions expressed are solely those of the author >>>>>> and do not necessarily represent those of Exonar Ltd. If you are not >>>>>> the intended recipient of this email, you must neither take any action >>>>>> based upon its contents, nor copy or show it to anyone. Please >>>>>> contact the sender if you believe you have received this email in error. >>>>>> >>>>>> On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Paul -- >>>>>>> >>>>>>> Just to be clear -- the branch for CONNECTORS-1027 is a branch of >>>>>>> trunk, which is MCF 2.0. MCF 2.0 is not backwards compatible with any >>>>>>> previous MCF release, and indeed there is no upgrade from any 1.x >>>>>>> release >>>>>>> to 2.0. That's why I said to use the patches, and try to stay on 1.6.1 >>>>>>> or >>>>>>> at most to migrate to 1.7. >>>>>>> >>>>>>> IF you ALREADY tried an upgrade with the branch code, then you would >>>>>>> have wound up in a schema state where the schema had more columns in it >>>>>>> than the branch knew how to deal with. That's bad, and you will need >>>>>>> to do >>>>>>> things to fix the situation. I believe you should still be able to do >>>>>>> the >>>>>>> following: >>>>>>> >>>>>>> - Download 1.7 source, or check out >>>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch >>>>>>> - Apply the patches >>>>>>> - Build >>>>>>> - Modify your properties.xml to point to your postgresql instance >>>>>>> - Run the upgrade (initialize.bat on the multi-process example, or >>>>>>> start the single-process example) >>>>>>> >>>>>>> You should then have a working 1.7 release, with code patches >>>>>>> applied. >>>>>>> >>>>>>> Thanks, >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Thanks - we've pulled down the branch and will test the changes. It >>>>>>>> looks like a branch of 1.7 so it's going to take us a little while to >>>>>>>> test. >>>>>>>> We need to migrate our connectors (there's some deprecated stuff >>>>>>>> that's now >>>>>>>> been cleared in 1.7 .eg. getShareACL) and we'll need to patch the >>>>>>>> database >>>>>>>> to include the pipeline and any other schema changes. We'll have some >>>>>>>> environment contention over the next week as our performance test >>>>>>>> environment needs to remain on 1.6.1 while we test a release. Once >>>>>>>> that's >>>>>>>> clear I'll move to 1.7 >>>>>>>> >>>>>>>> On the database schema patch moving from 1.6.1 to 1.7 - is there a >>>>>>>> simple way to migrate and existing database? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Paul >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> VP Engineering, >>>>>>>> Exonar Ltd >>>>>>>> >>>>>>>> T: +44 7940 567724 >>>>>>>> >>>>>>>> twitter:@exonarco @pboichat >>>>>>>> W: http://www.exonar.com >>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>> <http://video.exonar.com/> >>>>>>>> >>>>>>>> Exonar Limited, registered in the UK, registration number 06439969 >>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>> confidential and are intended solely for the use of the individual to >>>>>>>> whom >>>>>>>> it is addressed. Any views or opinions expressed are solely those of >>>>>>>> the >>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>> you are not the intended recipient of this email, you must neither >>>>>>>> take any >>>>>>>> action based upon its contents, nor copy or show it to anyone. Please >>>>>>>> contact the sender if you believe you have received this email in >>>>>>>> error. >>>>>>>> >>>>>>>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks -- I'll include that change as well then, in ticket >>>>>>>>> CONNECTORS-1027. >>>>>>>>> >>>>>>>>> >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> That comes back immediately with 10001 rows: >>>>>>>>>> >>>>>>>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue >>>>>>>>>> LIMIT 10001) t; >>>>>>>>>> >>>>>>>>>> QUERY PLAN >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ----------------------------------------------------------------------------------------------------------------------- >>>>>>>>>> ---------------------------------- >>>>>>>>>> Aggregate (cost=544.08..544.09 rows=1 width=0) (actual >>>>>>>>>> time=9.125..9.125 rows=1 loops=1) >>>>>>>>>> -> Limit (cost=0.00..419.07 rows=10001 width=0) (actual >>>>>>>>>> time=0.033..6.945 rows=10001 loops=1) >>>>>>>>>> -> Index Only Scan using jobqueue_pkey on jobqueue >>>>>>>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time >>>>>>>>>> =0.031..3.257 rows=10001 loops=1) >>>>>>>>>> Heap Fetches: 725 >>>>>>>>>> Total runtime: 9.157 ms >>>>>>>>>> (5 rows) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Whereas: >>>>>>>>>> >>>>>>>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001; >>>>>>>>>> >>>>>>>>>> QUERY PLAN >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ----------------------------------------------------------------------------------------------------------------------- >>>>>>>>>> ---------------------------------------- >>>>>>>>>> Limit (cost=456922.99..456923.00 rows=1 width=0) (actual >>>>>>>>>> time=5225.107..5225.109 rows=1 loops=1) >>>>>>>>>> -> Aggregate (cost=456922.99..456923.00 rows=1 width=0) >>>>>>>>>> (actual time=5225.105..5225.106 rows=1 loops=1) >>>>>>>>>> -> Index Only Scan using jobqueue_pkey on jobqueue >>>>>>>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time >>>>>>>>>> =0.108..3090.848 rows=10370209 loops=1) >>>>>>>>>> Heap Fetches: 684297 >>>>>>>>>> Total runtime: 5225.151 ms >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Paul >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> VP Engineering, >>>>>>>>>> Exonar Ltd >>>>>>>>>> >>>>>>>>>> T: +44 7940 567724 >>>>>>>>>> >>>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>>> W: http://www.exonar.com >>>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>>> <http://video.exonar.com/> >>>>>>>>>> >>>>>>>>>> Exonar Limited, registered in the UK, registration number >>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>>>> confidential and are intended solely for the use of the individual >>>>>>>>>> to whom >>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of >>>>>>>>>> the >>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>>>> you are not the intended recipient of this email, you must neither >>>>>>>>>> take any >>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please >>>>>>>>>> contact the sender if you believe you have received this email in >>>>>>>>>> error. >>>>>>>>>> >>>>>>>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <[email protected] >>>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>>> Hi Paul, >>>>>>>>>>> >>>>>>>>>>> Could you try this query on your database please and tell me if >>>>>>>>>>> it executes promptly: >>>>>>>>>>> >>>>>>>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I vaguely remember that I had to change the form of this query >>>>>>>>>>> in order to support MySQL -- but first let's see if this helps. >>>>>>>>>>> >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <[email protected] >>>>>>>>>>> > wrote: >>>>>>>>>>> >>>>>>>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based >>>>>>>>>>>> branch (branches/CONNECTORS-1027) for looking at any changes we do >>>>>>>>>>>> for >>>>>>>>>>>> large-scale Postgresql optimization work. >>>>>>>>>>>> >>>>>>>>>>>> Please note that trunk code already has schema changes relative >>>>>>>>>>>> to MCF 1.7, so you will not be able to work directly with this >>>>>>>>>>>> branch >>>>>>>>>>>> code. I'll have to create patches for whatever changes you would >>>>>>>>>>>> need to >>>>>>>>>>>> try. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Karl >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them >>>>>>>>>>>>> to the doc. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> >>>>>>>>>>>>> Paul >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> VP Engineering, >>>>>>>>>>>>> Exonar Ltd >>>>>>>>>>>>> >>>>>>>>>>>>> T: +44 7940 567724 >>>>>>>>>>>>> >>>>>>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>>>>>> W: http://www.exonar.com >>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>>>>>> <http://video.exonar.com/> >>>>>>>>>>>>> >>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number >>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>>>>>>> confidential and are intended solely for the use of the >>>>>>>>>>>>> individual to whom >>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those >>>>>>>>>>>>> of the >>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>>>>>>> you are not the intended recipient of this email, you must >>>>>>>>>>>>> neither take any >>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. >>>>>>>>>>>>> Please >>>>>>>>>>>>> contact the sender if you believe you have received this email in >>>>>>>>>>>>> error. >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Paul, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Can you include the logged plan for this query; this is an >>>>>>>>>>>>>> actual query encountered during crawling: >>>>>>>>>>>>>> >>>>>>>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a >>>>>>>>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid >>>>>>>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND >>>>>>>>>>>>>> t1.parentidhash=? AND >>>>>>>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?] >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> These queries are all from the UI; it is what gets generated >>>>>>>>>>>>>> when no limits are in place: >>>>>>>>>>>>>> >>>>>>>>>>>>>> WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found >>>>>>>>>>>>>> a long-running query (166845 ms): [SELECT jobid,COUNT(dochash) >>>>>>>>>>>>>> AS doccount >>>>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid] >>>>>>>>>>>>>> WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found >>>>>>>>>>>>>> a long-running query (107222 ms): [SELECT jobid,COUNT(dochash) >>>>>>>>>>>>>> AS doccount >>>>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid] >>>>>>>>>>>>>> >>>>>>>>>>>>>> This query is from the UI with a limit of 1000000: >>>>>>>>>>>>>> >>>>>>>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found >>>>>>>>>>>>>> a long-running query (254851 ms): [SELECT COUNT(dochash) AS >>>>>>>>>>>>>> doccount FROM >>>>>>>>>>>>>> jobqueue t1 LIMIT 1000001] >>>>>>>>>>>>>> >>>>>>>>>>>>>> I honestly don't understand why PostgreSQL would execute a >>>>>>>>>>>>>> sequential scan of the entire table when given a limit clause. >>>>>>>>>>>>>> It >>>>>>>>>>>>>> certainly didn't used to do that. If you have any other >>>>>>>>>>>>>> suggestions please >>>>>>>>>>>>>> let me know. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Some queries show up in this list because MCF periodically >>>>>>>>>>>>>> reindexes tables. For example, this query goes only against the >>>>>>>>>>>>>> (small) >>>>>>>>>>>>>> jobs table. Its poor performance on occasion is likely due to >>>>>>>>>>>>>> something >>>>>>>>>>>>>> else happening to the database, probably a reindex: >>>>>>>>>>>>>> >>>>>>>>>>>>>> WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a >>>>>>>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE >>>>>>>>>>>>>> status IN >>>>>>>>>>>>>> (?,?,?,?,?) FOR UPDATE] >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> The final query is the document stuffing query, which is >>>>>>>>>>>>>> perhaps the most critical query in the whole system: >>>>>>>>>>>>>> >>>>>>>>>>>>>> SELECT >>>>>>>>>>>>>> t0.id >>>>>>>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount, >>>>>>>>>>>>>> t0.priorityset FROM jobqueue t0 >>>>>>>>>>>>>> WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND >>>>>>>>>>>>>> t0.checktime >>>>>>>>>>>>>> <= 1407246846166 >>>>>>>>>>>>>> AND EXISTS ( >>>>>>>>>>>>>> SELECT 'x' FROM jobs t1 >>>>>>>>>>>>>> WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND >>>>>>>>>>>>>> t1.priority=5 >>>>>>>>>>>>>> ) >>>>>>>>>>>>>> AND NOT EXISTS ( >>>>>>>>>>>>>> SELECT 'x' FROM jobqueue t2 >>>>>>>>>>>>>> WHERE t2.dochash=t0.dochash AND t2.status IN >>>>>>>>>>>>>> ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid >>>>>>>>>>>>>> ) >>>>>>>>>>>>>> AND NOT EXISTS ( >>>>>>>>>>>>>> SELECT 'x' FROM prereqevents t3,events t4 >>>>>>>>>>>>>> WHERE t0.id=t3.owner AND t3.eventname=t4.name >>>>>>>>>>>>>> ) >>>>>>>>>>>>>> ORDER BY t0.docpriority ASC >>>>>>>>>>>>>> LIMIT 480; >>>>>>>>>>>>>> >>>>>>>>>>>>>> Your analysis of whether IN beats OR does not agree with >>>>>>>>>>>>>> experiments I did on postgresql 8.7 which showed no difference. >>>>>>>>>>>>>> What >>>>>>>>>>>>>> Postgresql version are you using? Also, I trust you have query >>>>>>>>>>>>>> plans that >>>>>>>>>>>>>> demonstrate your claim? In any case, whether IN vs. OR is >>>>>>>>>>>>>> generated is a >>>>>>>>>>>>>> function of the MCF database driver, so this is trivial to >>>>>>>>>>>>>> experiment >>>>>>>>>>>>>> with. I'll create a ticket and a branch for experimentation. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Karl >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Karl, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Changing maxcountstatus to something much smaller (10,000) >>>>>>>>>>>>>>> doesn't seem to buy us that much on the table scan - in the >>>>>>>>>>>>>>> attached you'll >>>>>>>>>>>>>>> see that it's still taking a long time to return the job status >>>>>>>>>>>>>>> page. Also >>>>>>>>>>>>>>> in the attached are some sample other long running queries that >>>>>>>>>>>>>>> we're >>>>>>>>>>>>>>> beginning to see more frequently. There's also an example of a >>>>>>>>>>>>>>> query that's >>>>>>>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a >>>>>>>>>>>>>>> suggested change >>>>>>>>>>>>>>> to improve performance). This one in particular would certainly >>>>>>>>>>>>>>> benefit >>>>>>>>>>>>>>> from a change to SSDs which should relieve the I/O bound >>>>>>>>>>>>>>> bottleneck on >>>>>>>>>>>>>>> postgres. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We're loading the system from 10mil towards 100mil so would >>>>>>>>>>>>>>> be keen to work with you to optimise where possible. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Paul >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> VP Engineering, >>>>>>>>>>>>>>> Exonar Ltd >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> T: +44 7940 567724 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>>>>>>>> W: http://www.exonar.com >>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>>>>>>>> <http://video.exonar.com/> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number >>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>>>>>>>>> confidential and are intended solely for the use of the >>>>>>>>>>>>>>> individual to whom >>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely >>>>>>>>>>>>>>> those of the >>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>>>>>>>>> you are not the intended recipient of this email, you must >>>>>>>>>>>>>>> neither take any >>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. >>>>>>>>>>>>>>> Please >>>>>>>>>>>>>>> contact the sender if you believe you have received this email >>>>>>>>>>>>>>> in error. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Paul, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The jobstatus query that uses count(*) should be doing >>>>>>>>>>>>>>>> something like this when the maxdocumentstatuscount value is >>>>>>>>>>>>>>>> set: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This will still do a sequential scan, but it will be an >>>>>>>>>>>>>>>> aborted one, so you can control the maximum amount of time >>>>>>>>>>>>>>>> spent doing the >>>>>>>>>>>>>>>> query. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it >>>>>>>>>>>>>>>>> from count(*)-ing but I'll certainly have another look to see >>>>>>>>>>>>>>>>> if we've >>>>>>>>>>>>>>>>> missed something. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We're increasingly seeing long running threads and I'll >>>>>>>>>>>>>>>>> put together some samples. As an example, on a job that's >>>>>>>>>>>>>>>>> currently >>>>>>>>>>>>>>>>> aborting: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a >>>>>>>>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET >>>>>>>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?] >>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - >>>>>>>>>>>>>>>>> Parameter 0: '1.000000001E9' >>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - >>>>>>>>>>>>>>>>> Parameter 1: '1407144048075' >>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: >>>>>>>>>>>>>>>>> Update on jobqueue (cost=18806.08..445770.39 rows=764916 >>>>>>>>>>>>>>>>> width=287) >>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>>>>>>>> Plan: -> Bitmap Heap Scan on jobqueue >>>>>>>>>>>>>>>>> (cost=18806.08..445770.39 >>>>>>>>>>>>>>>>> rows=764916 width=287) >>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>>>>>>>> Plan: Recheck Cond: (jobid = 1407144048075::bigint) >>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>>>>>>>> Plan: -> Bitmap Index Scan on i1392985450177 >>>>>>>>>>>>>>>>> (cost=0.00..18614.85 >>>>>>>>>>>>>>>>> rows=764916 width=0) >>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>>>>>>>> Plan: Index Cond: (jobid = >>>>>>>>>>>>>>>>> 1407144048075::bigint) >>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:30,140 (Job reset thread) - Stats: >>>>>>>>>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P} >>>>>>>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665} >>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:30,140 (Job reset thread) - >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Paul >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> VP Engineering, >>>>>>>>>>>>>>>>> Exonar Ltd >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> T: +44 7940 567724 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>>>>>>>>>> W: http://www.exonar.com >>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>>>>>>>>>> <http://video.exonar.com/> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number >>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the >>>>>>>>>>>>>>>>> individual to whom >>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely >>>>>>>>>>>>>>>>> those of the >>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. >>>>>>>>>>>>>>>>> If >>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must >>>>>>>>>>>>>>>>> neither take any >>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to >>>>>>>>>>>>>>>>> anyone. Please >>>>>>>>>>>>>>>>> contact the sender if you believe you have received this >>>>>>>>>>>>>>>>> email in error. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Paul, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter >>>>>>>>>>>>>>>>>> you can set which limits the number of documents counted to >>>>>>>>>>>>>>>>>> at most a >>>>>>>>>>>>>>>>>> specified amount. This uses a limit clause, which should >>>>>>>>>>>>>>>>>> prevent unbounded >>>>>>>>>>>>>>>>>> time doing these kinds of queries: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The documentation says that the default value for this >>>>>>>>>>>>>>>>>> parameter is 10000, which however is incorrect. The actual >>>>>>>>>>>>>>>>>> true default is >>>>>>>>>>>>>>>>>> 500000. You could set that lower for better UI performance >>>>>>>>>>>>>>>>>> (losing some >>>>>>>>>>>>>>>>>> information, of course.) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> As for long-running queries, a lot of time and effort has >>>>>>>>>>>>>>>>>> been spent in MCF to insure that this doesn't happen. >>>>>>>>>>>>>>>>>> Specifically, the >>>>>>>>>>>>>>>>>> main document queuing query is structured to read directly >>>>>>>>>>>>>>>>>> out of a >>>>>>>>>>>>>>>>>> specific jobqueue index. This is the crucial query that >>>>>>>>>>>>>>>>>> must work properly >>>>>>>>>>>>>>>>>> for scalability, since doing a query that is effectively >>>>>>>>>>>>>>>>>> just a sort on the >>>>>>>>>>>>>>>>>> entire jobqueue would be a major problem. There are some >>>>>>>>>>>>>>>>>> times where >>>>>>>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, >>>>>>>>>>>>>>>>>> mostly because it >>>>>>>>>>>>>>>>>> makes a huge distinction between whether there's zero of >>>>>>>>>>>>>>>>>> something or one >>>>>>>>>>>>>>>>>> of something, but you can work around that particular issue >>>>>>>>>>>>>>>>>> by setting the >>>>>>>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which >>>>>>>>>>>>>>>>>> basically >>>>>>>>>>>>>>>>>> means that reanalysis of the table has to occur on every >>>>>>>>>>>>>>>>>> stuffing query. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running >>>>>>>>>>>>>>>>>> in your case so that I can see if that is what you are >>>>>>>>>>>>>>>>>> encountering or not. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Karl, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> We're beginning to see issues with a document count > 10 >>>>>>>>>>>>>>>>>>> million. At that point, even with good postgres >>>>>>>>>>>>>>>>>>> vacuuming the jobqueue table is starting to become a >>>>>>>>>>>>>>>>>>> bottleneck. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> For example select count(*) from jobqueue, which is >>>>>>>>>>>>>>>>>>> executed when querying job status will do a full table scan >>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>> jobqueue which has more than 10 million rows. That's >>>>>>>>>>>>>>>>>>> going to take some time in postgres. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> SSDs will certainly make a big difference to document >>>>>>>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in >>>>>>>>>>>>>>>>>>> postgres) but we are increasingly seeing long running >>>>>>>>>>>>>>>>>>> queries in the logs. Our current thinking is that we'll >>>>>>>>>>>>>>>>>>> need to refactor >>>>>>>>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially >>>>>>>>>>>>>>>>>>> partition jobqueue into a subset of tables (table per >>>>>>>>>>>>>>>>>>> queue for example). >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Paul >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> VP Engineering, >>>>>>>>>>>>>>>>>>> Exonar Ltd >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> T: +44 7940 567724 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>>>>>>>>>>>> W: http://www.exonar.com >>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>>>>>>>>>>>> <http://video.exonar.com/> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration >>>>>>>>>>>>>>>>>>> number 06439969 at 14 West Mills, Newbury, Berkshire, >>>>>>>>>>>>>>>>>>> RG14 5HG >>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the >>>>>>>>>>>>>>>>>>> individual to whom >>>>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely >>>>>>>>>>>>>>>>>>> those of the >>>>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar >>>>>>>>>>>>>>>>>>> Ltd. If >>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must >>>>>>>>>>>>>>>>>>> neither take any >>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to >>>>>>>>>>>>>>>>>>> anyone. Please >>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this >>>>>>>>>>>>>>>>>>> email in error. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Baptiste, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents >>>>>>>>>>>>>>>>>>>> processes or parallel connectors. Overall database >>>>>>>>>>>>>>>>>>>> performance is the >>>>>>>>>>>>>>>>>>>> limiting factor. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I would read this: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe >>>>>>>>>>>>>>>>>>>> Chapter 2) that discusses this issue. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million >>>>>>>>>>>>>>>>>>>> web documents, using Postgresql 8.3. Postgresql 9.x is >>>>>>>>>>>>>>>>>>>> faster, and with >>>>>>>>>>>>>>>>>>>> modern SSD's, I expect that you will do even better. In >>>>>>>>>>>>>>>>>>>> general, I'd say >>>>>>>>>>>>>>>>>>>> it was fine to shoot for 10M - 100M documents on >>>>>>>>>>>>>>>>>>>> ManifoldCF, provided that >>>>>>>>>>>>>>>>>>>> you use a good database, and provided that you maintain it >>>>>>>>>>>>>>>>>>>> properly. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I would like to know what is the maximum number of >>>>>>>>>>>>>>>>>>>>> documents that you managed to crawl with ManifoldCF and >>>>>>>>>>>>>>>>>>>>> with how many >>>>>>>>>>>>>>>>>>>>> connectors in parallel it could works ? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks for your answer >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Baptiste >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
