An answer to one of my own questions. I'd still love help with the others.
> Some questions:
> -------------------------
> 1) After 12 iterations I'm still seeing more than 4,500 documents out
> of 45,000 that are unfetched. How might I go about determining why the
> unfeteched urls are not being fetched?
I ran bin/nutch readdb -stats -dump links and found that many of the
documents that were unfetched were 500 errors, 404, socket timeouts or
just that the depth was greater than my iterations markers {dist=13}.
The 500 errors directly relate to my other question though about
making sure that I'm not saturating a website and causing those 500
errors. I checked some of the urls that reported 500 errors in the
webpage table by hand and they are returning 200 response codes now.
>
> 2) Any suggestions for modifying the interation steps and/or
> parameters for each step in successive iterations to decrease crawl
> times and/or increase the number of fetched urls? topN? threads?
>
> 3) Any additional information on what the mapred related parameters do?
> mapred.reduce.tasks.speculative.execution=false
> mapred.map.tasks.speculative.execution=false
> mapred.compress.map.output=true
> mapred.skip.attempts.to.start.skipping=2
> mapred.skip.map.max.skip.records=1
>
> 4) During my local, single node crawl I've seen a few sites throw 500
> errors and become unresponsive. How can I ensure that I'm not DOSing
> and crashing the sites I'm crawling?
> * fetcher.server.delay=5.0
> * fetcher.threads.fetch=100
> * fetcher.threads.per.queue=100
> * fetcher.threads.per.host=100
> * db.fetch.schedule.class=org.apache.nutch.crawl.AdaptiveFetchSchedule
> * http.timeout=30000
> * db.ignore.external.links=true
>
> 5) What value should I set for gora.buffer.read.limit? Currently it's
> set to the default of 10000. During fetch steps #6-#12 nearly 50% of
> the time was spent reading from HBase. I was seeing
> gora.buffer.read.limit=10000 show up for several minutes in the logs.
>
> Thanks,
> Matt
>
> On Fri, Sep 28, 2012 at 8:21 AM, Julien Nioche
> <[email protected]> wrote:
>> Hi Matt
>>
>>
>>> > the fetch step is likely to take most of the time and the time it takes
>>> it
>>> > mostly a matter of the distribution of hosts/IP/domains in your
>>> fetchlist.
>>> > Search the WIKI for details on performance tips
>>>
>>> Thanks. Most of the urls that I'm fetching are each on their own
>>> IP/hosts and unique servers.
>>>
>>
>> Ok, you might want to use a large number of threads then
>> (fetcher.threads.fetch)
>>
>> [...]
>>
>>
>>>
>>> >
>>> >
>>> >> * Why would Hbase show 64,000 documents but ElasticSearch only 50,000?
>>> >>
>>> >
>>> > redirections? sounds quite a lot though
>>>
>>> Thoughts for how I would identify which are redirects?
>>>
>>
>> try using 'nutch readdb' to dump the content of the webtable and inspect
>> the URLs
>>
>> Julien
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble