Hi, Great feedback, suggestions and activity on this list.
Based on guidance from the list I stopped using the bin/nutch crawl command and am now calling each step individually. Julien, you suggested that I start with https://issues.apache.org/jira/secure/attachment/12535851/NUTCH-1087-2.1.patch. I'm more comfortable working with Ruby than shell scripts so I ported the script to Ruby and added some additional logging to help me better understand the timing and output of each step. There are a few parameters that are used in the shell script that I'm unclear of what impact they have or if they are being used and I'd love feedback on what they mean and how I might tweak them. Called during Generate & Fetch: ------------------------- mapred.reduce.tasks.speculative.execution=false mapred.map.tasks.speculative.execution=false mapred.compress.map.output=true Called during Parse: ------------------------- mapred.skip.attempts.to.start.skipping=2 mapred.skip.map.max.skip.records=1 I've run 12 crawl iterations over the 177 websites that I'm crawling and I'm wondering if the results are what others might expect. These are my crawling commands: ------------------------- 0) nutch inject #{options[:seed_dir]} Loop 1) nutch generate -D mapred.map.tasks=2 -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -numFetchers 1 -noFilter 2) nutch fetch -D mapred.map.tasks=2 -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true <BATCH_ID> 3) nutch parse -D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1 <BATCH_ID> 4) nutch updatedb Iterations #2-#5 resulted in: ------------------------- Average iteration time: 30-35 minutes Iterations #6-#12 resulted in (realized I should be timing each step): ------------------------- Average generate time: 250 seconds Average fetch time: 400 seconds Average parse time: 450 seconds Average update time: 300 seconds Average total iteration time: 20-25 minutes HBase size after 12 interations: 11.02GB After the 12th iteration readdb -stats resulted in the following output ------------------------- WebTable statistics start Statistics for WebTable: status 2 (status_fetched): 39611 min score: 0.0 retry 0: 43146 jobs: {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=7829, MAP_INPUT_RECORDS=45859, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=1339, MAP_OUTPUT_BYTES=2430527, COMMITTED_HEAP_BYTES=27249123328, COMBINE_INPUT_RECORDS=183436, SPLIT_RAW_BYTES=78062, REDUCE_INPUT_RECORDS=463, REDUCE_INPUT_GROUPS=118, COMBINE_OUTPUT_RECORDS=463, REDUCE_OUTPUT_RECORDS=118, MAP_OUTPUT_RECORDS=183436}, FileSystemCounters={FILE_BYTES_READ=32439253, FILE_BYTES_WRITTEN=32913783}, File Output Format Counters ={BYTES_WRITTEN=2520}}}} retry 1: 2713 status 5 (status_redir_perm): 1373 max score: 19.345 TOTAL urls: 45859 status 4 (status_redir_temp): 346 status 1 (status_unfetched): 4529 avg score: 0.04870584 WebTable statistics: done status 2 (status_fetched): 39611 min score: 0.0 retry 0: 43146 jobs: {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=7829, MAP_INPUT_RECORDS=45859, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=1339, MAP_OUTPUT_BYTES=2430527, COMMITTED_HEAP_BYTES=27249123328, COMBINE_INPUT_RECORDS=183436, SPLIT_RAW_BYTES=78062, REDUCE_INPUT_RECORDS=463, REDUCE_INPUT_GROUPS=118, COMBINE_OUTPUT_RECORDS=463, REDUCE_OUTPUT_RECORDS=118, MAP_OUTPUT_RECORDS=183436}, FileSystemCounters={FILE_BYTES_READ=32439253, FILE_BYTES_WRITTEN=32913783}, File Output Format Counters ={BYTES_WRITTEN=2520}}}} retry 1: 2713 status 5 (status_redir_perm): 1373 max score: 19.345 TOTAL urls: 45859 status 4 (status_redir_temp): 346 status 1 (status_unfetched): 4529 avg score: 0.04870584 Some questions: ------------------------- 1) After 12 iterations I'm still seeing more than 4,500 documents out of 45,000 that are unfetched. How might I go about determining why the unfeteched urls are not being fetched? 2) Any suggestions for modifying the interation steps and/or parameters for each step in successive iterations to decrease crawl times and/or increase the number of fetched urls? topN? threads? 3) Any additional information on what the mapred related parameters do? mapred.reduce.tasks.speculative.execution=false mapred.map.tasks.speculative.execution=false mapred.compress.map.output=true mapred.skip.attempts.to.start.skipping=2 mapred.skip.map.max.skip.records=1 4) During my local, single node crawl I've seen a few sites throw 500 errors and become unresponsive. How can I ensure that I'm not DOSing and crashing the sites I'm crawling? * fetcher.server.delay=5.0 * fetcher.threads.fetch=100 * fetcher.threads.per.queue=100 * fetcher.threads.per.host=100 * db.fetch.schedule.class=org.apache.nutch.crawl.AdaptiveFetchSchedule * http.timeout=30000 * db.ignore.external.links=true 5) What value should I set for gora.buffer.read.limit? Currently it's set to the default of 10000. During fetch steps #6-#12 nearly 50% of the time was spent reading from HBase. I was seeing gora.buffer.read.limit=10000 show up for several minutes in the logs. Thanks, Matt On Fri, Sep 28, 2012 at 8:21 AM, Julien Nioche <[email protected]> wrote: > Hi Matt > > >> > the fetch step is likely to take most of the time and the time it takes >> it >> > mostly a matter of the distribution of hosts/IP/domains in your >> fetchlist. >> > Search the WIKI for details on performance tips >> >> Thanks. Most of the urls that I'm fetching are each on their own >> IP/hosts and unique servers. >> > > Ok, you might want to use a large number of threads then > (fetcher.threads.fetch) > > [...] > > >> >> > >> > >> >> * Why would Hbase show 64,000 documents but ElasticSearch only 50,000? >> >> >> > >> > redirections? sounds quite a lot though >> >> Thoughts for how I would identify which are redirects? >> > > try using 'nutch readdb' to dump the content of the webtable and inspect > the URLs > > Julien > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble

