Thanks -- I'll include that change as well then, in ticket CONNECTORS-1027.
Karl On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <[email protected]> wrote: > Hi, > > That comes back immediately with 10001 rows: > > explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT > 10001) t; > > QUERY PLAN > > > ----------------------------------------------------------------------------------------------------------------------- > ---------------------------------- > Aggregate (cost=544.08..544.09 rows=1 width=0) (actual time=9.125..9.125 > rows=1 loops=1) > -> Limit (cost=0.00..419.07 rows=10001 width=0) (actual > time=0.033..6.945 rows=10001 loops=1) > -> Index Only Scan using jobqueue_pkey on jobqueue > (cost=0.00..431189.31 rows=10290271 width=0) (actual time > =0.031..3.257 rows=10001 loops=1) > Heap Fetches: 725 > Total runtime: 9.157 ms > (5 rows) > > > Whereas: > > explain analyze SELECT count(*) FROM jobqueue limit 10001; > > QUERY PLAN > > > ----------------------------------------------------------------------------------------------------------------------- > ---------------------------------------- > Limit (cost=456922.99..456923.00 rows=1 width=0) (actual > time=5225.107..5225.109 rows=1 loops=1) > -> Aggregate (cost=456922.99..456923.00 rows=1 width=0) (actual > time=5225.105..5225.106 rows=1 loops=1) > -> Index Only Scan using jobqueue_pkey on jobqueue > (cost=0.00..431197.31 rows=10290271 width=0) (actual time > =0.108..3090.848 rows=10370209 loops=1) > Heap Fetches: 684297 > Total runtime: 5225.151 ms > > Thanks, > > Paul > > > > VP Engineering, > Exonar Ltd > > T: +44 7940 567724 > > twitter:@exonarco @pboichat > W: http://www.exonar.com > Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> > > Exonar Limited, registered in the UK, registration number 06439969 at 14 > West Mills, Newbury, Berkshire, RG14 5HG > DISCLAIMER: This email and any attachments to it may be confidential and > are intended solely for the use of the individual to whom it is addressed. > Any views or opinions expressed are solely those of the author and do not > necessarily represent those of Exonar Ltd. If you are not the intended > recipient of this email, you must neither take any action based upon its > contents, nor copy or show it to anyone. Please contact the sender if you > believe you have received this email in error. > > On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <[email protected]> wrote: > >> Hi Paul, >> >> Could you try this query on your database please and tell me if it >> executes promptly: >> >> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t >> >> >> I vaguely remember that I had to change the form of this query in order >> to support MySQL -- but first let's see if this helps. >> >> Karl >> >> >> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <[email protected]> wrote: >> >>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch >>> (branches/CONNECTORS-1027) for looking at any changes we do for large-scale >>> Postgresql optimization work. >>> >>> Please note that trunk code already has schema changes relative to MCF >>> 1.7, so you will not be able to work directly with this branch code. I'll >>> have to create patches for whatever changes you would need to try. >>> >>> Thanks, >>> Karl >>> >>> >>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> We're on Postgres 9.2. I'll get the query plans and add them to the doc. >>>> >>>> Thanks >>>> >>>> Paul >>>> >>>> >>>> >>>> VP Engineering, >>>> Exonar Ltd >>>> >>>> T: +44 7940 567724 >>>> >>>> twitter:@exonarco @pboichat >>>> W: http://www.exonar.com >>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> >>>> >>>> Exonar Limited, registered in the UK, registration number 06439969 at 14 >>>> West Mills, Newbury, Berkshire, RG14 5HG >>>> DISCLAIMER: This email and any attachments to it may be confidential >>>> and are intended solely for the use of the individual to whom it is >>>> addressed. Any views or opinions expressed are solely those of the author >>>> and do not necessarily represent those of Exonar Ltd. If you are not >>>> the intended recipient of this email, you must neither take any action >>>> based upon its contents, nor copy or show it to anyone. Please contact >>>> the sender if you believe you have received this email in error. >>>> >>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <[email protected]> >>>> wrote: >>>> >>>>> Hi Paul, >>>>> >>>>> Can you include the logged plan for this query; this is an actual >>>>> query encountered during crawling: >>>>> >>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a >>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid >>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND >>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?] >>>>> >>>>> >>>>> These queries are all from the UI; it is what gets generated when no >>>>> limits are in place: >>>>> >>>>> WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a >>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount >>>>> FROM jobqueue t1 GROUP BY jobid] >>>>> WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a >>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount >>>>> FROM jobqueue t1 GROUP BY jobid] >>>>> >>>>> This query is from the UI with a limit of 1000000: >>>>> >>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a >>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM >>>>> jobqueue t1 LIMIT 1000001] >>>>> >>>>> I honestly don't understand why PostgreSQL would execute a sequential >>>>> scan of the entire table when given a limit clause. It certainly didn't >>>>> used to do that. If you have any other suggestions please let me know. >>>>> >>>>> >>>>> >>>>> Some queries show up in this list because MCF periodically reindexes >>>>> tables. For example, this query goes only against the (small) jobs table. >>>>> Its poor performance on occasion is likely due to something else happening >>>>> to the database, probably a reindex: >>>>> >>>>> WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a long-running >>>>> query (592474 ms): [SELECT id FROM jobs WHERE status IN (?,?,?,?,?) FOR >>>>> UPDATE] >>>>> >>>>> >>>>> The final query is the document stuffing query, which is perhaps the >>>>> most critical query in the whole system: >>>>> >>>>> SELECT >>>>> t0.id >>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount, >>>>> t0.priorityset FROM jobqueue t0 >>>>> WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND t0.checktime >>>>> <= 1407246846166 >>>>> AND EXISTS ( >>>>> SELECT 'x' FROM jobs t1 >>>>> WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND >>>>> t1.priority=5 >>>>> ) >>>>> AND NOT EXISTS ( >>>>> SELECT 'x' FROM jobqueue t2 >>>>> WHERE t2.dochash=t0.dochash AND t2.status IN >>>>> ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid >>>>> ) >>>>> AND NOT EXISTS ( >>>>> SELECT 'x' FROM prereqevents t3,events t4 >>>>> WHERE t0.id=t3.owner AND t3.eventname=t4.name >>>>> ) >>>>> ORDER BY t0.docpriority ASC >>>>> LIMIT 480; >>>>> >>>>> Your analysis of whether IN beats OR does not agree with experiments I >>>>> did on postgresql 8.7 which showed no difference. What Postgresql version >>>>> are you using? Also, I trust you have query plans that demonstrate your >>>>> claim? In any case, whether IN vs. OR is generated is a function of the >>>>> MCF database driver, so this is trivial to experiment with. I'll create a >>>>> ticket and a branch for experimentation. >>>>> >>>>> Thanks, >>>>> Karl >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <[email protected] >>>>> > wrote: >>>>> >>>>>> Hi Karl, >>>>>> >>>>>> Changing maxcountstatus to something much smaller (10,000) doesn't >>>>>> seem to buy us that much on the table scan - in the attached you'll see >>>>>> that it's still taking a long time to return the job status page. Also in >>>>>> the attached are some sample other long running queries that we're >>>>>> beginning to see more frequently. There's also an example of a query >>>>>> that's >>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change >>>>>> to improve performance). This one in particular would certainly benefit >>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on >>>>>> postgres. >>>>>> >>>>>> We're loading the system from 10mil towards 100mil so would be keen >>>>>> to work with you to optimise where possible. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Paul >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> VP Engineering, >>>>>> Exonar Ltd >>>>>> >>>>>> T: +44 7940 567724 >>>>>> >>>>>> twitter:@exonarco @pboichat >>>>>> W: http://www.exonar.com >>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> >>>>>> >>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14 >>>>>> West Mills, Newbury, Berkshire, RG14 5HG >>>>>> DISCLAIMER: This email and any attachments to it may be confidential >>>>>> and are intended solely for the use of the individual to whom it is >>>>>> addressed. Any views or opinions expressed are solely those of the author >>>>>> and do not necessarily represent those of Exonar Ltd. If you are not >>>>>> the intended recipient of this email, you must neither take any action >>>>>> based upon its contents, nor copy or show it to anyone. Please >>>>>> contact the sender if you believe you have received this email in error. >>>>>> >>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Paul, >>>>>>> >>>>>>> The jobstatus query that uses count(*) should be doing something >>>>>>> like this when the maxdocumentstatuscount value is set: >>>>>>> >>>>>>> select count(*) from jobqueue where xxx limit 500001 >>>>>>> >>>>>>> This will still do a sequential scan, but it will be an aborted one, >>>>>>> so you can control the maximum amount of time spent doing the query. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> We've had a play with maxstatuscount and couldn't stop it from >>>>>>>> count(*)-ing but I'll certainly have another look to see if we've >>>>>>>> missed >>>>>>>> something. >>>>>>>> >>>>>>>> We're increasingly seeing long running threads and I'll put >>>>>>>> together some samples. As an example, on a job that's currently >>>>>>>> aborting: >>>>>>>> >>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a >>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET >>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?] >>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Parameter 0: >>>>>>>> '1.000000001E9' >>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Parameter 1: >>>>>>>> '1407144048075' >>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: Update on >>>>>>>> jobqueue (cost=18806.08..445770.39 rows=764916 width=287) >>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: -> >>>>>>>> Bitmap Heap Scan on jobqueue (cost=18806.08..445770.39 rows=764916 >>>>>>>> width=287) >>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: >>>>>>>> Recheck Cond: (jobid = 1407144048075::bigint) >>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: >>>>>>>> -> Bitmap Index Scan on i1392985450177 (cost=0.00..18614.85 >>>>>>>> rows=764916 >>>>>>>> width=0) >>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>> Plan: Index Cond: (jobid = 1407144048075::bigint) >>>>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>>>> WARN 2014-09-10 18:37:30,140 (Job reset thread) - Stats: >>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P} >>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665} >>>>>>>> WARN 2014-09-10 18:37:30,140 (Job reset thread) - >>>>>>>> >>>>>>>> Paul >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> VP Engineering, >>>>>>>> Exonar Ltd >>>>>>>> >>>>>>>> T: +44 7940 567724 >>>>>>>> >>>>>>>> twitter:@exonarco @pboichat >>>>>>>> W: http://www.exonar.com >>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>> <http://video.exonar.com/> >>>>>>>> >>>>>>>> Exonar Limited, registered in the UK, registration number 06439969 >>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>> confidential and are intended solely for the use of the individual to >>>>>>>> whom >>>>>>>> it is addressed. Any views or opinions expressed are solely those of >>>>>>>> the >>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>> you are not the intended recipient of this email, you must neither >>>>>>>> take any >>>>>>>> action based upon its contents, nor copy or show it to anyone. Please >>>>>>>> contact the sender if you believe you have received this email in >>>>>>>> error. >>>>>>>> >>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Paul, >>>>>>>>> >>>>>>>>> For the jobqueue scans from the UI, there is a parameter you can >>>>>>>>> set which limits the number of documents counted to at most a >>>>>>>>> specified >>>>>>>>> amount. This uses a limit clause, which should prevent unbounded time >>>>>>>>> doing these kinds of queries: >>>>>>>>> >>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount >>>>>>>>> >>>>>>>>> The documentation says that the default value for this parameter >>>>>>>>> is 10000, which however is incorrect. The actual true default is >>>>>>>>> 500000. >>>>>>>>> You could set that lower for better UI performance (losing some >>>>>>>>> information, of course.) >>>>>>>>> >>>>>>>>> As for long-running queries, a lot of time and effort has been >>>>>>>>> spent in MCF to insure that this doesn't happen. Specifically, the >>>>>>>>> main >>>>>>>>> document queuing query is structured to read directly out of a >>>>>>>>> specific >>>>>>>>> jobqueue index. This is the crucial query that must work properly for >>>>>>>>> scalability, since doing a query that is effectively just a sort on >>>>>>>>> the >>>>>>>>> entire jobqueue would be a major problem. There are some times where >>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly >>>>>>>>> because it >>>>>>>>> makes a huge distinction between whether there's zero of something or >>>>>>>>> one >>>>>>>>> of something, but you can work around that particular issue by >>>>>>>>> setting the >>>>>>>>> analyze count to 1 if you start to see this problem -- which basically >>>>>>>>> means that reanalysis of the table has to occur on every stuffing >>>>>>>>> query. >>>>>>>>> >>>>>>>>> I'd appreciate seeing the queries that are long-running in your >>>>>>>>> case so that I can see if that is what you are encountering or not. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi Karl, >>>>>>>>>> >>>>>>>>>> We're beginning to see issues with a document count > 10 million. >>>>>>>>>> At that point, even with good postgres vacuuming the jobqueue >>>>>>>>>> table is starting to become a bottleneck. >>>>>>>>>> >>>>>>>>>> For example select count(*) from jobqueue, which is executed >>>>>>>>>> when querying job status will do a full table scan of jobqueue >>>>>>>>>> which has more than 10 million rows. That's going to take some time >>>>>>>>>> in >>>>>>>>>> postgres. >>>>>>>>>> >>>>>>>>>> SSDs will certainly make a big difference to document processing >>>>>>>>>> through-put (which we see is largely I/O bound in postgres) but >>>>>>>>>> we are increasingly seeing long running queries in the logs. Our >>>>>>>>>> current >>>>>>>>>> thinking is that we'll need to refactor JobQueue somewhat to >>>>>>>>>> optimise queries and, potentially partition jobqueue into a >>>>>>>>>> subset of tables (table per queue for example). >>>>>>>>>> >>>>>>>>>> Paul >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> VP Engineering, >>>>>>>>>> Exonar Ltd >>>>>>>>>> >>>>>>>>>> T: +44 7940 567724 >>>>>>>>>> >>>>>>>>>> twitter:@exonarco @pboichat >>>>>>>>>> W: http://www.exonar.com >>>>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>>>> <http://video.exonar.com/> >>>>>>>>>> >>>>>>>>>> Exonar Limited, registered in the UK, registration number >>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>>>> confidential and are intended solely for the use of the individual >>>>>>>>>> to whom >>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of >>>>>>>>>> the >>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>>>> you are not the intended recipient of this email, you must neither >>>>>>>>>> take any >>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please >>>>>>>>>> contact the sender if you believe you have received this email in >>>>>>>>>> error. >>>>>>>>>> >>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Baptiste, >>>>>>>>>>> >>>>>>>>>>> ManifoldCF is not limited by the number of agents processes or >>>>>>>>>>> parallel connectors. Overall database performance is the limiting >>>>>>>>>>> factor. >>>>>>>>>>> >>>>>>>>>>> I would read this: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html >>>>>>>>>>> >>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2) that >>>>>>>>>>> discusses this issue. >>>>>>>>>>> >>>>>>>>>>> Some five years ago, I successfully crawled 5 million web >>>>>>>>>>> documents, using Postgresql 8.3. Postgresql 9.x is faster, and >>>>>>>>>>> with modern >>>>>>>>>>> SSD's, I expect that you will do even better. In general, I'd say >>>>>>>>>>> it was >>>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided that >>>>>>>>>>> you use >>>>>>>>>>> a good database, and provided that you maintain it properly. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi >>>>>>>>>>>> >>>>>>>>>>>> I would like to know what is the maximum number of documents >>>>>>>>>>>> that you managed to crawl with ManifoldCF and with how many >>>>>>>>>>>> connectors in >>>>>>>>>>>> parallel it could works ? >>>>>>>>>>>> >>>>>>>>>>>> Thanks for your answer >>>>>>>>>>>> >>>>>>>>>>>> Baptiste >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
