Hi Paul, The jobstatus query that uses count(*) should be doing something like this when the maxdocumentstatuscount value is set:
select count(*) from jobqueue where xxx limit 500001 This will still do a sequential scan, but it will be an aborted one, so you can control the maximum amount of time spent doing the query. Karl On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <[email protected]> wrote: > Hi, > > We've had a play with maxstatuscount and couldn't stop it from > count(*)-ing but I'll certainly have another look to see if we've missed > something. > > We're increasingly seeing long running threads and I'll put together some > samples. As an example, on a job that's currently aborting: > > WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a long-running > query (72902 ms): [UPDATE jobqueue SET docpriority=?,priorityset=NULL WHERE > jobid=?] > WARN 2014-09-10 18:37:29,900 (Job reset thread) - Parameter 0: > '1.000000001E9' > WARN 2014-09-10 18:37:29,900 (Job reset thread) - Parameter 1: > '1407144048075' > WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: Update on > jobqueue (cost=18806.08..445770.39 rows=764916 width=287) > WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: -> Bitmap > Heap Scan on jobqueue (cost=18806.08..445770.39 rows=764916 width=287) > WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: Recheck > Cond: (jobid = 1407144048075::bigint) > WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: -> > Bitmap Index Scan on i1392985450177 (cost=0.00..18614.85 rows=764916 > width=0) > WARN 2014-09-10 18:37:29,960 (Job reset thread) - Plan: > Index Cond: (jobid = 1407144048075::bigint) > WARN 2014-09-10 18:37:29,960 (Job reset thread) - > WARN 2014-09-10 18:37:30,140 (Job reset thread) - Stats: n_distinct=4.0 > most_common_vals={G,C,Z,P} > most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665} > WARN 2014-09-10 18:37:30,140 (Job reset thread) - > > Paul > > > > VP Engineering, > Exonar Ltd > > T: +44 7940 567724 > > twitter:@exonarco @pboichat > W: http://www.exonar.com > Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> > > Exonar Limited, registered in the UK, registration number 06439969 at 14 > West Mills, Newbury, Berkshire, RG14 5HG > DISCLAIMER: This email and any attachments to it may be confidential and > are intended solely for the use of the individual to whom it is addressed. > Any views or opinions expressed are solely those of the author and do not > necessarily represent those of Exonar Ltd. If you are not the intended > recipient of this email, you must neither take any action based upon its > contents, nor copy or show it to anyone. Please contact the sender if you > believe you have received this email in error. > > On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <[email protected]> wrote: > >> Hi Paul, >> >> For the jobqueue scans from the UI, there is a parameter you can set >> which limits the number of documents counted to at most a specified >> amount. This uses a limit clause, which should prevent unbounded time >> doing these kinds of queries: >> >> org.apache.manifoldcf.ui.maxstatuscount >> >> The documentation says that the default value for this parameter is >> 10000, which however is incorrect. The actual true default is 500000. You >> could set that lower for better UI performance (losing some information, of >> course.) >> >> As for long-running queries, a lot of time and effort has been spent in >> MCF to insure that this doesn't happen. Specifically, the main document >> queuing query is structured to read directly out of a specific jobqueue >> index. This is the crucial query that must work properly for scalability, >> since doing a query that is effectively just a sort on the entire jobqueue >> would be a major problem. There are some times where Postgresql's >> optimizer fails to do the right thing here, mostly because it makes a huge >> distinction between whether there's zero of something or one of something, >> but you can work around that particular issue by setting the analyze count >> to 1 if you start to see this problem -- which basically means that >> reanalysis of the table has to occur on every stuffing query. >> >> I'd appreciate seeing the queries that are long-running in your case so >> that I can see if that is what you are encountering or not. >> >> Thanks, >> Karl >> >> >> >> >> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <[email protected]> >> wrote: >> >>> Hi Karl, >>> >>> We're beginning to see issues with a document count > 10 million. At >>> that point, even with good postgres vacuuming the jobqueue table is >>> starting to become a bottleneck. >>> >>> For example select count(*) from jobqueue, which is executed when >>> querying job status will do a full table scan of jobqueue which has >>> more than 10 million rows. That's going to take some time in postgres. >>> >>> SSDs will certainly make a big difference to document processing >>> through-put (which we see is largely I/O bound in postgres) but we are >>> increasingly seeing long running queries in the logs. Our current thinking >>> is that we'll need to refactor JobQueue somewhat to optimise queries >>> and, potentially partition jobqueue into a subset of tables (table per >>> queue for example). >>> >>> Paul >>> >>> >>> >>> VP Engineering, >>> Exonar Ltd >>> >>> T: +44 7940 567724 >>> >>> twitter:@exonarco @pboichat >>> W: http://www.exonar.com >>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/> >>> >>> Exonar Limited, registered in the UK, registration number 06439969 at 14 >>> West Mills, Newbury, Berkshire, RG14 5HG >>> DISCLAIMER: This email and any attachments to it may be confidential >>> and are intended solely for the use of the individual to whom it is >>> addressed. Any views or opinions expressed are solely those of the author >>> and do not necessarily represent those of Exonar Ltd. If you are not >>> the intended recipient of this email, you must neither take any action >>> based upon its contents, nor copy or show it to anyone. Please contact >>> the sender if you believe you have received this email in error. >>> >>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <[email protected]> wrote: >>> >>>> Hi Baptiste, >>>> >>>> ManifoldCF is not limited by the number of agents processes or parallel >>>> connectors. Overall database performance is the limiting factor. >>>> >>>> I would read this: >>>> >>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html >>>> >>>> Also, there's a section in ManifoldCF (I believe Chapter 2) that >>>> discusses this issue. >>>> >>>> Some five years ago, I successfully crawled 5 million web documents, >>>> using Postgresql 8.3. Postgresql 9.x is faster, and with modern SSD's, I >>>> expect that you will do even better. In general, I'd say it was fine to >>>> shoot for 10M - 100M documents on ManifoldCF, provided that you use a good >>>> database, and provided that you maintain it properly. >>>> >>>> Thanks, >>>> Karl >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier < >>>> [email protected]> wrote: >>>> >>>>> Hi >>>>> >>>>> I would like to know what is the maximum number of documents that you >>>>> managed to crawl with ManifoldCF and with how many connectors in parallel >>>>> it could works ? >>>>> >>>>> Thanks for your answer >>>>> >>>>> Baptiste >>>>> >>>> >>>> >>> >> >
