Re: Infinite loop bug in Nutch 0.9
George, Try using Nutch-1.0 instead. I have tested your example with the SVN version and it did not get into the problem you described. J. 2009/4/2 George Herlin > Indeed I have... that's how I found out. > > My test case: crawl > > http://www.purdue.ca/research/research_clinical.asp > > with crawl-urlfilter and regex-urlfilter ending with > > #purdue > +^http://www.purdue.ca/research/ > +^http://www.purdue.ca/pdf/ > > # reject anything else > -. > > The site is very small (which helped in diagnosis). > > Attached the beginning of a run log, just in case > > brgds > > George > > LOG > Resource not found: commons-logging.properties > Resource not found: META-INF/services/org.apache.commons.logging.LogFactory > Resource not found: log4j.xml > Resource found: log4j.properties > Resource found: hadoop-default.xml > Resource found: hadoop-site.xml > Resource found: nutch-default.xml > Resource found: nutch-site.xml > Resource not found: crawl-tool.xml > Injector: starting > Injector: crawlDb: crawl-www.purdue.ca-20090402110952/crawldb > Injector: urlDir: conf/purdueHttp > Injector: Converting injected urls to crawl db entries. > Resource not found: > META-INF/services/javax.xml.transform.TransformerFactory > Resource not found: > > META-INF/services/com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager > Resource not found: > com/sun/org/apache/xml/internal/serializer/XMLEntities_en.properties > Resource not found: > com/sun/org/apache/xml/internal/serializer/XMLEntities_en_US.properties > Resource found: regex-normalize.xml > Resource found: regex-urlfilter.txt > Injector: Merging injected urls into crawl db. > Injector: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: > crawl-www.purdue.ca-20090402110952/segments/20090402110955 > Generator: filtering: false > Generator: topN: 2147483647 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: > crawl-www.purdue.ca-20090402110952/segments/20090402110955 > Fetcher: threads: 1 > Resource found: parse-plugins.xml > fetching http://www.purdue.ca/research/research_clinical.asp > Resource found: mime-types.xml > Resource not found: META-INF/services/org.apache.xerces.impl.Version > Resource found: www.purdue.ca.html.parser-conf.properties > Resource found: www.purdue.ca.resultslist.html.parser-conf.properties > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb > CrawlDb update: segments: > [crawl-www.purdue.ca-20090402110952/segments/20090402110955] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: > crawl-www.purdue.ca-20090402110952/segments/20090402111003 > Generator: filtering: false > Generator: topN: 2147483647 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: > crawl-www.purdue.ca-20090402110952/segments/20090402111003 > Fetcher: threads: 1 > fetching http://www.purdue.ca/research/ > fetching http://www.purdue.ca/research/research_ongoing.asp > fetching http://www.purdue.ca/research/research_quality.asp > fetching http://www.purdue.ca/research/research_completed.asp > fetching http://www.purdue.ca/research/research_contin.asp > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb > CrawlDb update: segments: > [crawl-www.purdue.ca-20090402110952/segments/20090402111003] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: > crawl-www.purdue.ca-20090402110952/segments/20090402111024 > Generator: filtering: false > Generator: topN: 2147483647 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: > crawl-www.purdue.ca-20090402110952/segments/20090402111024 > Fetcher: threads: 1 > fetching http://www.purdue.ca/research/research.asp > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb > CrawlDb update: segments: > [crawl-www.purdue.ca-20090402110952/segments/20090402111024] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging seg
Re: Infinite loop bug in Nutch 0.9
Indeed I have... that's how I found out. My test case: crawl http://www.purdue.ca/research/research_clinical.asp with crawl-urlfilter and regex-urlfilter ending with #purdue +^http://www.purdue.ca/research/ +^http://www.purdue.ca/pdf/ # reject anything else -. The site is very small (which helped in diagnosis). Attached the beginning of a run log, just in case brgds George LOG Resource not found: commons-logging.properties Resource not found: META-INF/services/org.apache.commons.logging.LogFactory Resource not found: log4j.xml Resource found: log4j.properties Resource found: hadoop-default.xml Resource found: hadoop-site.xml Resource found: nutch-default.xml Resource found: nutch-site.xml Resource not found: crawl-tool.xml Injector: starting Injector: crawlDb: crawl-www.purdue.ca-20090402110952/crawldb Injector: urlDir: conf/purdueHttp Injector: Converting injected urls to crawl db entries. Resource not found: META-INF/services/javax.xml.transform.TransformerFactory Resource not found: META-INF/services/com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager Resource not found: com/sun/org/apache/xml/internal/serializer/XMLEntities_en.properties Resource not found: com/sun/org/apache/xml/internal/serializer/XMLEntities_en_US.properties Resource found: regex-normalize.xml Resource found: regex-urlfilter.txt Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-www.purdue.ca-20090402110952/segments/20090402110955 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl-www.purdue.ca-20090402110952/segments/20090402110955 Fetcher: threads: 1 Resource found: parse-plugins.xml fetching http://www.purdue.ca/research/research_clinical.asp Resource found: mime-types.xml Resource not found: META-INF/services/org.apache.xerces.impl.Version Resource found: www.purdue.ca.html.parser-conf.properties Resource found: www.purdue.ca.resultslist.html.parser-conf.properties Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb CrawlDb update: segments: [crawl-www.purdue.ca-20090402110952/segments/20090402110955] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-www.purdue.ca-20090402110952/segments/20090402111003 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl-www.purdue.ca-20090402110952/segments/20090402111003 Fetcher: threads: 1 fetching http://www.purdue.ca/research/ fetching http://www.purdue.ca/research/research_ongoing.asp fetching http://www.purdue.ca/research/research_quality.asp fetching http://www.purdue.ca/research/research_completed.asp fetching http://www.purdue.ca/research/research_contin.asp Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb CrawlDb update: segments: [crawl-www.purdue.ca-20090402110952/segments/20090402111003] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-www.purdue.ca-20090402110952/segments/20090402111024 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl-www.purdue.ca-20090402110952/segments/20090402111024 Fetcher: threads: 1 fetching http://www.purdue.ca/research/research.asp Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb CrawlDb update: segments: [crawl-www.purdue.ca-20090402110952/segments/20090402111024] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-www.purdue.ca-20090402110952/segments/20090402111031 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fe
Re: Infinite loop bug in Nutch 0.9
On Wed, Apr 1, 2009 at 13:29, George Herlin wrote: > Sorry, forgot to say, there is an added precondition to causing the bug: > > The redirection has to be fetched before the page it redirects to... if > not, there will be a pre.existing crawl datum with an reasonable > refetch-interval. > Maybe this is something fixed between 0.9 and 1.0, but I think CrawlDbReducer fixes these datums, around line 147 (case CrawlDatum.STATUS_LINKED). Have you even got stuck in an infinite loop because of it? > > > 2009/4/1 George Herlin > > Hello, there. >> >> I believe I may have found a infinite loop in Nutch 0.9. >> >> It happens when a site has a page that refers to itself through a >> redirection. >> >> The code in Fetcher.run(), around line 200 - sorry, my Fetcher has been a >> little modified, line numbers may vary a little - says, for that case: >> >> output(url, new CrawlDatum(), null, null, CrawlDatum.STATUS_LINKED); >> >> What that does is, inserts an extra (empty) crawl datum for the new url, >> with a re-fetch interval of 0.0. >> >> However, (see Generator.Selector.map(), particularly lines 144-145), the >> non-refetch condition used seems to be last-fetch+refetch-interval>now ... >> which is always false if refetch-interval==0.0! >> >> Now, if there is a new link to the new url in that page, that crawl datum >> is re-used, and the whole thing loops indefinitely. >> >> I've fixed that for myself by changing the quoted line (twice) by: >> >> output(url, new CrawlDatum(CrawlDatum.STATUS_LINKED, 30f), null, null, >> CrawlDatum.STATUS_LINKED); >> >> and that works (btw the 30F should really be the value of >> "db.default.fetch.interval", but I haven't the time right now to work out >> the issues, but in reality the default constructor and the appropriate >> updater method should, if I am right in analysing the algorithm always >> enforce a positive refetch interval. >> >> Of course, another method could be used to remove this self-reference, but >> that couls be complicated, as that may happen through a loop (2 or more >> pages etc..., you know what I mean). >> >> Has that been fixed already, and by what method? >> >> Best regards >> >> George Herlin >> >> >> >> > -- Doğacan Güney
Re: Infinite loop bug in Nutch 0.9
Sorry, forgot to say, there is an added precondition to causing the bug: The redirection has to be fetched before the page it redirects to... if not, there will be a pre.existing crawl datum with an reasonable refetch-interval. 2009/4/1 George Herlin > Hello, there. > > I believe I may have found a infinite loop in Nutch 0.9. > > It happens when a site has a page that refers to itself through a > redirection. > > The code in Fetcher.run(), around line 200 - sorry, my Fetcher has been a > little modified, line numbers may vary a little - says, for that case: > > output(url, new CrawlDatum(), null, null, CrawlDatum.STATUS_LINKED); > > What that does is, inserts an extra (empty) crawl datum for the new url, > with a re-fetch interval of 0.0. > > However, (see Generator.Selector.map(), particularly lines 144-145), the > non-refetch condition used seems to be last-fetch+refetch-interval>now ... > which is always false if refetch-interval==0.0! > > Now, if there is a new link to the new url in that page, that crawl datum > is re-used, and the whole thing loops indefinitely. > > I've fixed that for myself by changing the quoted line (twice) by: > > output(url, new CrawlDatum(CrawlDatum.STATUS_LINKED, 30f), null, null, > CrawlDatum.STATUS_LINKED); > > and that works (btw the 30F should really be the value of > "db.default.fetch.interval", but I haven't the time right now to work out > the issues, but in reality the default constructor and the appropriate > updater method should, if I am right in analysing the algorithm always > enforce a positive refetch interval. > > Of course, another method could be used to remove this self-reference, but > that couls be complicated, as that may happen through a loop (2 or more > pages etc..., you know what I mean). > > Has that been fixed already, and by what method? > > Best regards > > George Herlin > > > >
Infinite loop bug in Nutch 0.9
Hello, there. I believe I may have found a infinite loop in Nutch 0.9. It happens when a site has a page that refers to itself through a redirection. The code in Fetcher.run(), around line 200 - sorry, my Fetcher has been a little modified, line numbers may vary a little - says, for that case: output(url, new CrawlDatum(), null, null, CrawlDatum.STATUS_LINKED); What that does is, inserts an extra (empty) crawl datum for the new url, with a re-fetch interval of 0.0. However, (see Generator.Selector.map(), particularly lines 144-145), the non-refetch condition used seems to be last-fetch+refetch-interval>now ... which is always false if refetch-interval==0.0! Now, if there is a new link to the new url in that page, that crawl datum is re-used, and the whole thing loops indefinitely. I've fixed that for myself by changing the quoted line (twice) by: output(url, new CrawlDatum(CrawlDatum.STATUS_LINKED, 30f), null, null, CrawlDatum.STATUS_LINKED); and that works (btw the 30F should really be the value of "db.default.fetch.interval", but I haven't the time right now to work out the issues, but in reality the default constructor and the appropriate updater method should, if I am right in analysing the algorithm always enforce a positive refetch interval. Of course, another method could be used to remove this self-reference, but that couls be complicated, as that may happen through a loop (2 or more pages etc..., you know what I mean). Has that been fixed already, and by what method? Best regards George Herlin