And, sorry, about the database size -- much of that is likely going into your history table. You can limit the amount of history stored, or disable history entirely, by means of a configuration parameter. Have a look at the "how-to-build-and-deploy" page.
Karl On Wed, May 21, 2014 at 4:31 PM, Tom Rees <[email protected]> wrote: > Dear ManifoldCF: > > First, I would like to report that switching to ManifoldCF 1.6 solved a > problem I encountered with version 1.4.1: whenever I ran two web crawls > simultaneously the two crawls would stop progressing within a half an hour. > The 1.6 version works beautifully. Thank you for the excellent work. > > Now I have a couple issues with the database that I would appreciate your > feedback on. First, the two crawls that I mentioned finished and pulled > down a little over 255,000 documents. The size of the postgres (version > 9.3.2) database on the disk, however, expanded to use a little over 8 GB of > space, and this is after running a full vacuum. This seems like a lot of > space for two medium sized crawls. Is there a way to get the web crawler to > use less database space? > > Secondly, when I ran two simultaneous web crawls with the NULL output > connector, the crawls worked without issue. When I ran the same two > simultaneous web crawls with a custom output connector that wrote the files > to a local file system everything worked fine. However, when I used an > output connector that wrote the downloaded files to a file system and put > the path to each file on an ActiveMQ JMS queue, then the crawl showed > quirky behavior. A few times the crawls stopped in their tracks and then > after 40 - 60 minutes a message was printed to the logfile saying that the > SQL queries took too long. The full dump of one set of these messages is > below, at the end of this email. The web crawls always recover, and they > are still running. I am using postgres 9.3.2 with manifoldcf, and so far it > has not had many issues, except for the occasional SQL taking too long > message, although these are infrequent. Do I need to use a different > version of postgres? Or make some other change? > > Thank you for you help. > > Tom Rees > Chiliad > > WARN 2014-05-21 11:05:08,230 (Worker thread '28') - Found a long-running > query (2662579 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id > IN(SELECT ownerid FROM hopdeleted > eps t0 WHERE t0.jobid=? AND t0.childidhash=? AND EXISTS(SELECT 'x' FROM > intrinsiclink t1 WHERE t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND > t1.parentidhash=t0.parentidhash AND > t1.childidhash=t0.childidhash AND t1.isnew=?))] > WARN 2014-05-21 11:05:08,230 (Worker thread '28') - Parameter 0: 'D' > WARN 2014-05-21 11:05:08,230 (Worker thread '28') - Parameter 1: '-1' > WARN 2014-05-21 11:05:08,230 (Worker thread '28') - Parameter 2: > '1400623413113' > WARN 2014-05-21 11:05:08,231 (Worker thread '28') - Parameter 3: > 'A2EB225081B47722CCAEB3293A28EEB2F264E02C' > WARN 2014-05-21 11:05:08,231 (Worker thread '28') - Parameter 4: 'B' > WARN 2014-05-21 11:05:08,243 (Worker thread '4') - Found a long-running > query (2625296 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id > IN(SELECT ownerid FROM hopdeletede > ps t0 WHERE t0.jobid=? AND t0.childidhash=? AND EXISTS(SELECT 'x' FROM > intrinsiclink t1 WHERE t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND > t1.parentidhash=t0.parentidhash AND > t1.childidhash=t0.childidhash AND t1.isnew=?))] > WARN 2014-05-21 11:05:08,243 (Worker thread '4') - Parameter 0: 'D' > WARN 2014-05-21 11:05:08,243 (Worker thread '4') - Parameter 1: '-1' > WARN 2014-05-21 11:05:08,243 (Worker thread '4') - Parameter 2: > '1400623413113' > WARN 2014-05-21 11:05:08,243 (Worker thread '4') - Parameter 3: > 'D942516DE5623A6417FCB994186B507E8CDA30D6' > WARN 2014-05-21 11:05:08,243 (Worker thread '4') - Parameter 4: 'B' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Found a long-running > query (2675765 ms): [SELECT parentidhash FROM intrinsiclink WHERE jobid=? > AND parentidhash IN (?,?,?,?,?,? > ,?,?,?,?,?,?,?,?,?,?,?,?,?,?) AND linktype=? AND childidhash=? FOR UPDATE] > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 0: > '1400623413113' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 1: > '054FC31ACF6FB96D2F8D19FF9CC230349E6A7A76' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 2: > '0774E538282FCA04F0FF95AC65D48EFC57CC6225' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 3: > '1027C9AF07AE2B419C31A1D3B20352E31867BBBB' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 4: > '1382DE9902A7CCC0012F043077E1739867CE00A4' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 5: > '2E8844A26FCD3096DF0D6BC3BB3D6648FCBCA7FA' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 6: > '34741F8B2706BCB202FDA72DABB94D916D497CD4' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 7: > '6A5E47B467A29A8614B473856F1D28EC8B30F5F3' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 8: > '71B865B0979B351279EFD9F99CA8AF700704400A' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 9: > '77C6E57EBDD811027F776BF895E0B43275AF3628' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 10: > '8267055C5CE6D7A1917F88B1FA310FC5082FD599' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 11: > '8F361A3EDA0CAC989812623441DA02BD42883C4F' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 12: > '956CCECF3FD5F508624E19270FD5EC28532B0922' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 13: > '9BAA3731F101B3908E4FFF4A5325601C57B4CD57' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 14: > 'AD628D16A2708EECD1C33AA0E63D849BCB5DF417' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 15: > 'B661E6DD08FD89A6643A706ECAB6E1729FC623C8' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 16: > 'D1F182BF5B49CB4FBF274A1B63B54C2F684EC059' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 17: > 'D7FB0CB3AFE34BC258686368296AF0D896C5786E' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 18: > 'D807BE55355A53CA84B4163F42081A896B323A81' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 19: > 'EDED88E796389DEB5E8DA14F1FD56088CDA8BF98' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 20: > 'FE4A24472BD3648F839FFAB7B5476915504A9755' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 21: 'link' > WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Parameter 22: > 'B661E6DD08FD89A6643A706ECAB6E1729FC623C8' > WARN 2014-05-21 11:05:08,289 (Worker thread '4') - Plan: Update on > hopcount (cost=157.53..165.57 rows=1 width=81) > WARN 2014-05-21 11:05:08,289 (Worker thread '28') - Plan: Update on > hopcount (cost=157.53..165.57 rows=1 width=81) > WARN 2014-05-21 11:05:08,289 (Worker thread '28') - Plan: -> Nested > Loop (cost=157.53..165.57 rows=1 width=81) > WARN 2014-05-21 11:05:08,289 (Worker thread '4') - Plan: -> Nested > Loop (cost=157.53..165.57 rows=1 width=81) > WARN 2014-05-21 11:05:08,289 (Worker thread '28') - Plan: -> > HashAggregate (cost=157.11..157.12 rows=1 width=20) > WARN 2014-05-21 11:05:08,290 (Worker thread '28') - Plan: > -> Hash Join (cost=101.51..157.11 rows=1 width=20) > WARN 2014-05-21 11:05:08,290 (Worker thread '4') - Plan: -> > HashAggregate (cost=157.11..157.12 rows=1 width=20) > WARN 2014-05-21 11:05:08,290 (Worker thread '28') - Plan: > Hash Cond: (((t0.linktype)::text = (t1.linktype)::text) AND > ((t0.parentidhash)::text = (t1.parentidhash)::text)) > WARN 2014-05-21 11:05:08,290 (Worker thread '4') - Plan: > -> Hash Join (cost=101.51..157.11 rows=1 width=20) > WARN 2014-05-21 11:05:08,290 (Worker thread '4') - Plan: > Hash Cond: (((t0.linktype)::text = (t1.linktype)::text) AND > ((t0.parentidhash)::text = (t1.parentidhash)::text)) > WARN 2014-05-21 11:05:08,290 (Worker thread '4') - Plan: > -> Index Scan using i1400371486543 on hopdeletedeps t0 > (cost=0.56..55.95 rows=27 width=109) > WARN 2014-05-21 11:05:08,290 (Worker thread '28') - Plan: > -> Index Scan using i1400371486543 on hopdeletedeps t0 > (cost=0.56..55.95 rows=27 width=109) > WARN 2014-05-21 11:05:08,290 (Worker thread '28') - Plan: > Index Cond: ((jobid = 1400623413113::bigint) AND > ((childidhash)::text = 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'::text)) > WARN 2014-05-21 11:05:08,290 (Worker thread '4') - Plan: > Index Cond: ((jobid = 1400623413113::bigint) AND > ((childidhash)::text = 'D942516DE5623A6417FCB994186B507E8CDA30D6'::text)) > WARN 2014-05-21 11:05:08,290 (Worker thread '28') - Plan: > -> Hash (cost=100.32..100.32 rows=42 width=101) > WARN 2014-05-21 11:05:08,290 (Worker thread '4') - Plan: > -> Hash (cost=100.32..100.32 rows=42 width=101) > WARN 2014-05-21 11:05:08,290 (Worker thread '28') - Plan: > -> Index Scan using i1400371486547 on intrinsiclink t1 > (cost=0.56..100.32 rows=42 width=101) > WARN 2014-05-21 11:05:08,290 (Worker thread '28') - Plan: > Index Cond: ((jobid = 1400623413113::bigint) AND > ((childidhash)::text = 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'::text) > AND (isnew = 'B'::bpchar)) > WARN 2014-05-21 11:05:08,290 (Worker thread '4') - Plan: > -> Index Scan using i1400371486547 on intrinsiclink t1 > (cost=0.56..100.32 rows=42 width=101) > WARN 2014-05-21 11:05:08,290 (Worker thread '28') - Plan: -> > Index Scan using hopcount_pkey on hopcount (cost=0.42..8.45 rows=1 > width=69) > WARN 2014-05-21 11:05:08,290 (Worker thread '4') - Plan: > Index Cond: ((jobid = 1400623413113::bigint) AND > ((childidhash)::text = 'D942516DE5623A6417FCB994186B507E8CDA30D6'::text) > AND (isnew = 'B'::bpchar)) > WARN 2014-05-21 11:05:08,290 (Worker thread '4') - Plan: -> > Index Scan using hopcount_pkey on hopcount (cost=0.42..8.45 rows=1 > width=69) > WARN 2014-05-21 11:05:08,290 (Worker thread '28') - Plan: > Index Cond: (id = t0.ownerid) > WARN 2014-05-21 11:05:08,290 (Worker thread '28') - > WARN 2014-05-21 11:05:08,290 (Worker thread '4') - Plan: > Index Cond: (id = t0.ownerid) > WARN 2014-05-21 11:05:08,290 (Worker thread '4') - > WARN 2014-05-21 11:05:08,294 (Worker thread '40') - Plan: LockRows > (cost=0.56..101.40 rows=3 width=47) (actual time=0.041..0.041 rows=0 > loops=1) > >
