Re: Infinite loop bug in Nutch 0.9
On Wed, Apr 1, 2009 at 13:29, George Herlin ghher...@gmail.com wrote: Sorry, forgot to say, there is an added precondition to causing the bug: The redirection has to be fetched before the page it redirects to... if not, there will be a pre.existing crawl datum with an reasonable refetch-interval. Maybe this is something fixed between 0.9 and 1.0, but I think CrawlDbReducer fixes these datums, around line 147 (case CrawlDatum.STATUS_LINKED). Have you even got stuck in an infinite loop because of it? 2009/4/1 George Herlin ghher...@gmail.com Hello, there. I believe I may have found a infinite loop in Nutch 0.9. It happens when a site has a page that refers to itself through a redirection. The code in Fetcher.run(), around line 200 - sorry, my Fetcher has been a little modified, line numbers may vary a little - says, for that case: output(url, new CrawlDatum(), null, null, CrawlDatum.STATUS_LINKED); What that does is, inserts an extra (empty) crawl datum for the new url, with a re-fetch interval of 0.0. However, (see Generator.Selector.map(), particularly lines 144-145), the non-refetch condition used seems to be last-fetch+refetch-intervalnow ... which is always false if refetch-interval==0.0! Now, if there is a new link to the new url in that page, that crawl datum is re-used, and the whole thing loops indefinitely. I've fixed that for myself by changing the quoted line (twice) by: output(url, new CrawlDatum(CrawlDatum.STATUS_LINKED, 30f), null, null, CrawlDatum.STATUS_LINKED); and that works (btw the 30F should really be the value of db.default.fetch.interval, but I haven't the time right now to work out the issues, but in reality the default constructor and the appropriate updater method should, if I am right in analysing the algorithm always enforce a positive refetch interval. Of course, another method could be used to remove this self-reference, but that couls be complicated, as that may happen through a loop (2 or more pages etc..., you know what I mean). Has that been fixed already, and by what method? Best regards George Herlin -- Doğacan Güney
Re: Infinite loop bug in Nutch 0.9
Indeed I have... that's how I found out. My test case: crawl http://www.purdue.ca/research/research_clinical.asp with crawl-urlfilter and regex-urlfilter ending with #purdue +^http://www.purdue.ca/research/ +^http://www.purdue.ca/pdf/ # reject anything else -. The site is very small (which helped in diagnosis). Attached the beginning of a run log, just in case brgds George LOG Resource not found: commons-logging.properties Resource not found: META-INF/services/org.apache.commons.logging.LogFactory Resource not found: log4j.xml Resource found: log4j.properties Resource found: hadoop-default.xml Resource found: hadoop-site.xml Resource found: nutch-default.xml Resource found: nutch-site.xml Resource not found: crawl-tool.xml Injector: starting Injector: crawlDb: crawl-www.purdue.ca-20090402110952/crawldb Injector: urlDir: conf/purdueHttp Injector: Converting injected urls to crawl db entries. Resource not found: META-INF/services/javax.xml.transform.TransformerFactory Resource not found: META-INF/services/com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager Resource not found: com/sun/org/apache/xml/internal/serializer/XMLEntities_en.properties Resource not found: com/sun/org/apache/xml/internal/serializer/XMLEntities_en_US.properties Resource found: regex-normalize.xml Resource found: regex-urlfilter.txt Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-www.purdue.ca-20090402110952/segments/20090402110955 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl-www.purdue.ca-20090402110952/segments/20090402110955 Fetcher: threads: 1 Resource found: parse-plugins.xml fetching http://www.purdue.ca/research/research_clinical.asp Resource found: mime-types.xml Resource not found: META-INF/services/org.apache.xerces.impl.Version Resource found: www.purdue.ca.html.parser-conf.properties Resource found: www.purdue.ca.resultslist.html.parser-conf.properties Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb CrawlDb update: segments: [crawl-www.purdue.ca-20090402110952/segments/20090402110955] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-www.purdue.ca-20090402110952/segments/20090402111003 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl-www.purdue.ca-20090402110952/segments/20090402111003 Fetcher: threads: 1 fetching http://www.purdue.ca/research/ fetching http://www.purdue.ca/research/research_ongoing.asp fetching http://www.purdue.ca/research/research_quality.asp fetching http://www.purdue.ca/research/research_completed.asp fetching http://www.purdue.ca/research/research_contin.asp Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb CrawlDb update: segments: [crawl-www.purdue.ca-20090402110952/segments/20090402111003] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-www.purdue.ca-20090402110952/segments/20090402111024 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl-www.purdue.ca-20090402110952/segments/20090402111024 Fetcher: threads: 1 fetching http://www.purdue.ca/research/research.asp Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb CrawlDb update: segments: [crawl-www.purdue.ca-20090402110952/segments/20090402111024] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-www.purdue.ca-20090402110952/segments/20090402111031 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting
[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694942#action_12694942 ] Julien Nioche commented on NUTCH-692: - As I pointed out in my previous message the root of the problem in my case was related to some dodgy URLs coming from the Javascript parser which put the basic normalizer into a spin. This would repeat in subsequent attempts indeed. However the AlreadyBeingCreatedException should not happen and we should not have output files left open. If you patch fixes that I am sure that this will be a very welcome contribution. AlreadyBeingCreatedException with Hadoop 0.19 - Key: NUTCH-692 URL: https://issues.apache.org/jira/browse/NUTCH-692 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Julien Nioche I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up. There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19 I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0? I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue J. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694986#action_12694986 ] Doğacan Güney commented on NUTCH-721: - I've committed nutch 0.9 fetcher as OldFetcher. So can you test with trunk and OldFetcher? Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694986#action_12694986 ] Doğacan Güney edited comment on NUTCH-721 at 4/2/09 6:01 AM: - I've committed nutch 0.9 fetcher as OldFetcher. So can you test with trunk and OldFetcher so that we can find out if this is related to new fetcher or is the side effect of some other change? was (Author: dogacan): I've committed nutch 0.9 fetcher as OldFetcher. So can you test with trunk and OldFetcher? Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Nutch Topical / Focused Crawl
Hi @ all, I'd like to turn Nutch into an focused / topical crawler. It's a part of my final year thesis. Further, I'd like that others can contribute from my work. I started to analyze the code and think that I found the right peace of code. I just wanted to know if I am on the right track. I think the right peace of code to implement a decision to fetch further is in the method output of the Fetcher class every time we call the collect method of the OutputCollector object. private ParseStatus output(Text key, CrawlDatum datum, Content content, ProtocolStatus pstatus, int status) { ... output.collect(...); ... } Would you mind to let me know the the best way to turn this decision into an plugin? I was thinking to go a similar way like the scoring filters. Thanks in advance. Cheers, MyD
Re: Nutch Topical / Focused Crawl
Hi @ all, I'd like to turn Nutch into an focused / topical crawler. It's a part of my final year thesis. Further, I'd like that others can contribute from my work. I started to analyze the code and think that I found the right peace of code. I just wanted to know if I am on the right track. I think the right peace of code to implement a decision to fetch further is in the method output of the Fetcher class every time we call the collect method of the OutputCollector object. private ParseStatus output(Text key, CrawlDatum datum, Content content, ProtocolStatus pstatus, int status) { ... output.collect(...); ... } Would you mind to let me know the the best way to turn this decision into an plugin? I was thinking to go a similar way like the scoring filters. Thanks in advance. Don't have the code in front of me right now, but we did something like this for a focused tech pages crawl with Krugle a few years back. Our goal was to influence the OPIC scores to ensure that pages we thought were likely to be good technical pages got fetched sooner. Assuming you're using the scoring-opic plugin, then you'd create a custom ScoringFilter that gets executed after the scoring-opic plugin. But the actual process of hooking every up was pretty complicated and error prone, unfortunately. We had to define our own keys for storing our custom scores inside of the parse_data Metadataa, the content Metadata, and the CrawlDB Metadata. And we had to implement following methods for our scoring plugin: setConf() injectScore() initialScore(); generateSortValue(); passScoreBeforeParsing(); passScoreAfterParsing(); shouldHarvestOutlinks(); distributeScoreToOutlink(); updateDbScore(); indexerScore(); -- Ken -- Ken Krugler +1 530-210-6378
Re: Infinite loop bug in Nutch 0.9
George, Try using Nutch-1.0 instead. I have tested your example with the SVN version and it did not get into the problem you described. J. 2009/4/2 George Herlin ghher...@gmail.com Indeed I have... that's how I found out. My test case: crawl http://www.purdue.ca/research/research_clinical.asp with crawl-urlfilter and regex-urlfilter ending with #purdue +^http://www.purdue.ca/research/ +^http://www.purdue.ca/pdf/ # reject anything else -. The site is very small (which helped in diagnosis). Attached the beginning of a run log, just in case brgds George LOG Resource not found: commons-logging.properties Resource not found: META-INF/services/org.apache.commons.logging.LogFactory Resource not found: log4j.xml Resource found: log4j.properties Resource found: hadoop-default.xml Resource found: hadoop-site.xml Resource found: nutch-default.xml Resource found: nutch-site.xml Resource not found: crawl-tool.xml Injector: starting Injector: crawlDb: crawl-www.purdue.ca-20090402110952/crawldb Injector: urlDir: conf/purdueHttp Injector: Converting injected urls to crawl db entries. Resource not found: META-INF/services/javax.xml.transform.TransformerFactory Resource not found: META-INF/services/com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager Resource not found: com/sun/org/apache/xml/internal/serializer/XMLEntities_en.properties Resource not found: com/sun/org/apache/xml/internal/serializer/XMLEntities_en_US.properties Resource found: regex-normalize.xml Resource found: regex-urlfilter.txt Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-www.purdue.ca-20090402110952/segments/20090402110955 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl-www.purdue.ca-20090402110952/segments/20090402110955 Fetcher: threads: 1 Resource found: parse-plugins.xml fetching http://www.purdue.ca/research/research_clinical.asp Resource found: mime-types.xml Resource not found: META-INF/services/org.apache.xerces.impl.Version Resource found: www.purdue.ca.html.parser-conf.properties Resource found: www.purdue.ca.resultslist.html.parser-conf.properties Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb CrawlDb update: segments: [crawl-www.purdue.ca-20090402110952/segments/20090402110955] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-www.purdue.ca-20090402110952/segments/20090402111003 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl-www.purdue.ca-20090402110952/segments/20090402111003 Fetcher: threads: 1 fetching http://www.purdue.ca/research/ fetching http://www.purdue.ca/research/research_ongoing.asp fetching http://www.purdue.ca/research/research_quality.asp fetching http://www.purdue.ca/research/research_completed.asp fetching http://www.purdue.ca/research/research_contin.asp Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb CrawlDb update: segments: [crawl-www.purdue.ca-20090402110952/segments/20090402111003] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-www.purdue.ca-20090402110952/segments/20090402111024 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl-www.purdue.ca-20090402110952/segments/20090402111024 Fetcher: threads: 1 fetching http://www.purdue.ca/research/research.asp Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb CrawlDb update: segments: [crawl-www.purdue.ca-20090402110952/segments/20090402111024] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator:
Using keywords metatags
Hi all. I would like to add keywords to the information that gets inserted into the Lucene Indexes. I am thinking I need to insert them into the WebDB and later on insert them into the Lucene indexes. Am I right? Which extension points do I need to use? Thanks in advance -- Rodrigo Reyes
[jira] Updated: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated NUTCH-692: Attachment: NUTCH-692.patch This just checks the destination file existence before attempting to create a new output MapFile for the reduce task in the FetcherOutputFormat and ParseOutputFormat. If the destination files exist it deletes them. The AlreadyBeingCreatedException is thrown when a MapFile creation attempt fails to create the same file as the previous failed task. AlreadyBeingCreatedException with Hadoop 0.19 - Key: NUTCH-692 URL: https://issues.apache.org/jira/browse/NUTCH-692 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Julien Nioche Attachments: NUTCH-692.patch I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up. There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19 I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0? I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue J. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695122#action_12695122 ] Doğacan Güney commented on NUTCH-692: - Thanks for the patch. Patch looks good to me. Can you confirm if this fixes the problem (or tell me how to trigger the problem without patch)? AlreadyBeingCreatedException with Hadoop 0.19 - Key: NUTCH-692 URL: https://issues.apache.org/jira/browse/NUTCH-692 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Julien Nioche Attachments: NUTCH-692.patch I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up. There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19 I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0? I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue J. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695170#action_12695170 ] Roger Dunk commented on NUTCH-721: -- For the following tests I've used the same segment containing 5000 URLs. I cleaned the named cache before the first two tests. [r...@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.OldFetcher newcrawl/segments/20090402130655/ real3m38.084s user2m20.887s sys 0m7.470s [r...@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher newcrawl/segments/20090402130655/ [...] Fetcher: done real53m44.800s user2m20.070s sys 0m9.527s For this next test, I used the same segment but didn't clear the named cache from the previous test, so all resolvable hosts should still be cached. This appeared to help greatly, as often times out of 80 active threads, only 60 were spinwaiting (as opposed to 79 in the non-cached test), but there were still plenty of times where at least 30 consecutive log entries showed 80 threads spinwaiting. And clearly as can be seen from the times below, still nowhere in the league of OldFetcher. [r...@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher newcrawl/segments/20090402130655/ [...] Aborting with 80 hung threads. Fetcher: done real22m5.420s user2m39.407s sys 0m8.192s Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695233#action_12695233 ] Hudson commented on NUTCH-721: -- Integrated in Nutch-trunk #772 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/772/]) - Commit old fetcher as OldFetcher for now so that we can test Fetcher2 performance. Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.