Hi Paul, Could you try this query on your database please and tell me if it executes promptly:
SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t I vaguely remember that I had to change the form of this query in order to support MySQL -- but first let's see if this helps. Karl On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <[email protected]> wrote: > I've created a ticket (CONNECTORS-1027) and a trunk-based branch > (branches/CONNECTORS-1027) for looking at any changes we do for large-scale > Postgresql optimization work. > > Please note that trunk code already has schema changes relative to MCF > 1.7, so you will not be able to work directly with this branch code. I'll > have to create patches for whatever changes you would need to try. > > Thanks, > Karl > > > On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <[email protected]> > wrote: > >> Hi, >> >> We're on Postgres 9.2. I'll get the query plans and add them to the doc. >> >> Thanks >> >> Paul >> >> >> >> VP Engineering, >> Exonar Ltd >> >> T: +44 7940 567724 >> >> twitter:@exonarco @pboichat >> W: http://www.exonar.com >> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> >> >> Exonar Limited, registered in the UK, registration number 06439969 at 14 >> West Mills, Newbury, Berkshire, RG14 5HG >> DISCLAIMER: This email and any attachments to it may be confidential and >> are intended solely for the use of the individual to whom it is addressed. >> Any views or opinions expressed are solely those of the author and do not >> necessarily represent those of Exonar Ltd. If you are not the intended >> recipient of this email, you must neither take any action based upon its >> contents, nor copy or show it to anyone. Please contact the sender if >> you believe you have received this email in error. >> >> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <[email protected]> wrote: >> >>> Hi Paul, >>> >>> Can you include the logged plan for this query; this is an actual query >>> encountered during crawling: >>> >>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a long-running >>> query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid FROM carrydown t1, >>> jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND >>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?] >>> >>> >>> These queries are all from the UI; it is what gets generated when no >>> limits are in place: >>> >>> WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a >>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount >>> FROM jobqueue t1 GROUP BY jobid] >>> WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a >>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount >>> FROM jobqueue t1 GROUP BY jobid] >>> >>> This query is from the UI with a limit of 1000000: >>> >>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a >>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM >>> jobqueue t1 LIMIT 1000001] >>> >>> I honestly don't understand why PostgreSQL would execute a sequential >>> scan of the entire table when given a limit clause. It certainly didn't >>> used to do that. If you have any other suggestions please let me know. >>> >>> >>> >>> Some queries show up in this list because MCF periodically reindexes >>> tables. For example, this query goes only against the (small) jobs table. >>> Its poor performance on occasion is likely due to something else happening >>> to the database, probably a reindex: >>> >>> WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a long-running >>> query (592474 ms): [SELECT id FROM jobs WHERE status IN (?,?,?,?,?) FOR >>> UPDATE] >>> >>> >>> The final query is the document stuffing query, which is perhaps the >>> most critical query in the whole system: >>> >>> SELECT >>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount, >>> t0.priorityset FROM jobqueue t0 >>> WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND t0.checktime >>> <= 1407246846166 >>> AND EXISTS ( >>> SELECT 'x' FROM jobs t1 >>> WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5 >>> ) >>> AND NOT EXISTS ( >>> SELECT 'x' FROM jobqueue t2 >>> WHERE t2.dochash=t0.dochash AND t2.status IN >>> ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid >>> ) >>> AND NOT EXISTS ( >>> SELECT 'x' FROM prereqevents t3,events t4 >>> WHERE t0.id=t3.owner AND t3.eventname=t4.name >>> ) >>> ORDER BY t0.docpriority ASC >>> LIMIT 480; >>> >>> Your analysis of whether IN beats OR does not agree with experiments I >>> did on postgresql 8.7 which showed no difference. What Postgresql version >>> are you using? Also, I trust you have query plans that demonstrate your >>> claim? In any case, whether IN vs. OR is generated is a function of the >>> MCF database driver, so this is trivial to experiment with. I'll create a >>> ticket and a branch for experimentation. >>> >>> Thanks, >>> Karl >>> >>> >>> >>> >>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <[email protected]> >>> wrote: >>> >>>> Hi Karl, >>>> >>>> Changing maxcountstatus to something much smaller (10,000) doesn't seem >>>> to buy us that much on the table scan - in the attached you'll see that >>>> it's still taking a long time to return the job status page. Also in the >>>> attached are some sample other long running queries that we're beginning to >>>> see more frequently. There's also an example of a query that's frequently >>>> executed and regularly takes > 4 secs (plus a suggested change to improve >>>> performance). This one in particular would certainly benefit from a change >>>> to SSDs which should relieve the I/O bound bottleneck on postgres. >>>> >>>> We're loading the system from 10mil towards 100mil so would be keen to >>>> work with you to optimise where possible. >>>> >>>> Thanks, >>>> >>>> Paul >>>> >>>> >>>> >>>> >>>> VP Engineering, >>>> Exonar Ltd >>>> >>>> T: +44 7940 567724 >>>> >>>> twitter:@exonarco @pboichat >>>> W: http://www.exonar.com >>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> >>>> >>>> Exonar Limited, registered in the UK, registration number 06439969 at 14 >>>> West Mills, Newbury, Berkshire, RG14 5HG >>>> DISCLAIMER: This email and any attachments to it may be confidential >>>> and are intended solely for the use of the individual to whom it is >>>> addressed. Any views or opinions expressed are solely those of the author >>>> and do not necessarily represent those of Exonar Ltd. If you are not >>>> the intended recipient of this email, you must neither take any action >>>> based upon its contents, nor copy or show it to anyone. Please contact >>>> the sender if you believe you have received this email in error. >>>> >>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <[email protected]> >>>> wrote: >>>> >>>>> Hi Paul, >>>>> >>>>> The jobstatus query that uses count(*) should be doing something like >>>>> this when the maxdocumentstatuscount value is set: >>>>> >>>>> select count(*) from jobqueue where xxx limit 500001 >>>>> >>>>> This will still do a sequential scan, but it will be an aborted one, >>>>> so you can control the maximum amount of time spent doing the query. >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <[email protected] >>>>> > wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> We've had a play with maxstatuscount and couldn't stop it from >>>>>> count(*)-ing but I'll certainly have another look to see if we've missed >>>>>> something. >>>>>> >>>>>> We're increasingly seeing long running threads and I'll put together >>>>>> some samples. As an example, on a job that's currently aborting: >>>>>> >>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a >>>>>> long-running query (72902 ms): [UPDATE jobqueue SET >>>>>> docpriority=?,priorityset=NULL WHERE jobid=?] >>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Parameter 0: >>>>>> '1.000000001E9' >>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Parameter 1: >>>>>> '1407144048075' >>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: Update on >>>>>> jobqueue (cost=18806.08..445770.39 rows=764916 width=287) >>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: -> >>>>>> Bitmap Heap Scan on jobqueue (cost=18806.08..445770.39 rows=764916 >>>>>> width=287) >>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: >>>>>> Recheck Cond: (jobid = 1407144048075::bigint) >>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: -> >>>>>> Bitmap Index Scan on i1392985450177 (cost=0.00..18614.85 rows=764916 >>>>>> width=0) >>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>> Plan: Index Cond: (jobid = 1407144048075::bigint) >>>>>> WARN 2014-09-10 18:37:29,960 (Job reset thread) - >>>>>> WARN 2014-09-10 18:37:30,140 (Job reset thread) - Stats: >>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P} >>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665} >>>>>> WARN 2014-09-10 18:37:30,140 (Job reset thread) - >>>>>> >>>>>> Paul >>>>>> >>>>>> >>>>>> >>>>>> VP Engineering, >>>>>> Exonar Ltd >>>>>> >>>>>> T: +44 7940 567724 >>>>>> >>>>>> twitter:@exonarco @pboichat >>>>>> W: http://www.exonar.com >>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> >>>>>> >>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14 >>>>>> West Mills, Newbury, Berkshire, RG14 5HG >>>>>> DISCLAIMER: This email and any attachments to it may be confidential >>>>>> and are intended solely for the use of the individual to whom it is >>>>>> addressed. Any views or opinions expressed are solely those of the author >>>>>> and do not necessarily represent those of Exonar Ltd. If you are not >>>>>> the intended recipient of this email, you must neither take any action >>>>>> based upon its contents, nor copy or show it to anyone. Please >>>>>> contact the sender if you believe you have received this email in error. >>>>>> >>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Paul, >>>>>>> >>>>>>> For the jobqueue scans from the UI, there is a parameter you can set >>>>>>> which limits the number of documents counted to at most a specified >>>>>>> amount. This uses a limit clause, which should prevent unbounded time >>>>>>> doing these kinds of queries: >>>>>>> >>>>>>> org.apache.manifoldcf.ui.maxstatuscount >>>>>>> >>>>>>> The documentation says that the default value for this parameter is >>>>>>> 10000, which however is incorrect. The actual true default is 500000. >>>>>>> You >>>>>>> could set that lower for better UI performance (losing some >>>>>>> information, of >>>>>>> course.) >>>>>>> >>>>>>> As for long-running queries, a lot of time and effort has been spent >>>>>>> in MCF to insure that this doesn't happen. Specifically, the main >>>>>>> document >>>>>>> queuing query is structured to read directly out of a specific jobqueue >>>>>>> index. This is the crucial query that must work properly for >>>>>>> scalability, >>>>>>> since doing a query that is effectively just a sort on the entire >>>>>>> jobqueue >>>>>>> would be a major problem. There are some times where Postgresql's >>>>>>> optimizer fails to do the right thing here, mostly because it makes a >>>>>>> huge >>>>>>> distinction between whether there's zero of something or one of >>>>>>> something, >>>>>>> but you can work around that particular issue by setting the analyze >>>>>>> count >>>>>>> to 1 if you start to see this problem -- which basically means that >>>>>>> reanalysis of the table has to occur on every stuffing query. >>>>>>> >>>>>>> I'd appreciate seeing the queries that are long-running in your case >>>>>>> so that I can see if that is what you are encountering or not. >>>>>>> >>>>>>> Thanks, >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi Karl, >>>>>>>> >>>>>>>> We're beginning to see issues with a document count > 10 million. >>>>>>>> At that point, even with good postgres vacuuming the jobqueue >>>>>>>> table is starting to become a bottleneck. >>>>>>>> >>>>>>>> For example select count(*) from jobqueue, which is executed when >>>>>>>> querying job status will do a full table scan of jobqueue which >>>>>>>> has more than 10 million rows. That's going to take some time in >>>>>>>> postgres. >>>>>>>> >>>>>>>> SSDs will certainly make a big difference to document processing >>>>>>>> through-put (which we see is largely I/O bound in postgres) but we >>>>>>>> are increasingly seeing long running queries in the logs. Our current >>>>>>>> thinking is that we'll need to refactor JobQueue somewhat to >>>>>>>> optimise queries and, potentially partition jobqueue into a subset >>>>>>>> of tables (table per queue for example). >>>>>>>> >>>>>>>> Paul >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> VP Engineering, >>>>>>>> Exonar Ltd >>>>>>>> >>>>>>>> T: +44 7940 567724 >>>>>>>> >>>>>>>> twitter:@exonarco @pboichat >>>>>>>> W: http://www.exonar.com >>>>>>>> Nothing is secure. Now what? Exonar Raven >>>>>>>> <http://video.exonar.com/> >>>>>>>> >>>>>>>> Exonar Limited, registered in the UK, registration number 06439969 >>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG >>>>>>>> DISCLAIMER: This email and any attachments to it may be >>>>>>>> confidential and are intended solely for the use of the individual to >>>>>>>> whom >>>>>>>> it is addressed. Any views or opinions expressed are solely those of >>>>>>>> the >>>>>>>> author and do not necessarily represent those of Exonar Ltd. If >>>>>>>> you are not the intended recipient of this email, you must neither >>>>>>>> take any >>>>>>>> action based upon its contents, nor copy or show it to anyone. Please >>>>>>>> contact the sender if you believe you have received this email in >>>>>>>> error. >>>>>>>> >>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Baptiste, >>>>>>>>> >>>>>>>>> ManifoldCF is not limited by the number of agents processes or >>>>>>>>> parallel connectors. Overall database performance is the limiting >>>>>>>>> factor. >>>>>>>>> >>>>>>>>> I would read this: >>>>>>>>> >>>>>>>>> >>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html >>>>>>>>> >>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2) that >>>>>>>>> discusses this issue. >>>>>>>>> >>>>>>>>> Some five years ago, I successfully crawled 5 million web >>>>>>>>> documents, using Postgresql 8.3. Postgresql 9.x is faster, and with >>>>>>>>> modern >>>>>>>>> SSD's, I expect that you will do even better. In general, I'd say it >>>>>>>>> was >>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided that >>>>>>>>> you use >>>>>>>>> a good database, and provided that you maintain it properly. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi >>>>>>>>>> >>>>>>>>>> I would like to know what is the maximum number of documents that >>>>>>>>>> you managed to crawl with ManifoldCF and with how many connectors in >>>>>>>>>> parallel it could works ? >>>>>>>>>> >>>>>>>>>> Thanks for your answer >>>>>>>>>> >>>>>>>>>> Baptiste >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
