Hi Lewis, > > The honest truth is that there needs to be comprehensive documentation on > the wiki for the way that Nutch handles redirects. This is a question that > has gone fully unanswered for sometime.
That's true. > In the meantime, can you adivise if there is anything over > and above the files in nutch-default.xml and o.a.n.protocol package which > you would like to see documented? I guess the poor documentation of nutch/hadoop is the biggest problem for beginners like me. I started with nutch ~4-6 month ago (not full time, but several hours every week). At first I wrote some plugins (parser/indexer). This was a bit tricky because i had learn directly from the source. Because most of the tutorials/documents were outdated (<1.0) or simply wrong. My crawler is now running and I need to scale it up. The current version runs in local mode but thats not really fast. So I started to setup a hadoop cluster (4 Nodes) to run nutch in the deploy mode. This is were I'm today and my current questions are: - i will buy some new hardware for the hadoop cluster, but i'm shure about the configuration. Is nutch i/o or cpu heavy? http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/ - what is the difference between protocol-httpclient and protocol-http? Just ssl and authentication? What about performance? - what is a good value for the following configuration parameter: - fetcher.threads.fetch - fetcher.threads.per.queue - mapred.tasktracker.map.tasks.maximum - mapred.tasktracker.reduce.tasks.maximum - mapred.map.tasks - mapred.reduce.tasks My current hardware is a 4 Node Cluster of dual CPU (quad core xeon), 32GB RAM, 2*2TB SATA HDD. I know it's impossible to define the "always right" value. But a rule of the thumb, to use as start value, would be very a great thing and would save me a lot of "try-and-error" investigation. - what's the difference fetcher.threads.fetch from the configuration an the -threads option from the crawl command? - is it possible to follow external links only on 301 redirects? - what is happening if a page is marked as db_redir_temp / db_redir_perm? Refetch after db.fetch.interval.default? I found loads tutorials and all of them have the "same" content, only the the very very basics (how to do your first crawl). I guess a comprehensive documentation would be a big step for the amazing nutch/hadoop project. Thanks in advance, Rafael. > > Thanks > > On Wed, Nov 16, 2011 at 7:17 PM, Rafael Pappert <[email protected]> wrote: > >> Hello List, >> >> is it possible to follow http 301 redirects immediately? >> >> I tried to set http.redirect.max to 3 but the page is >> still not indexed. readdb is still showing 1 page is >> unfetched / db_redir_perm. And I can't find the >> redirection target in the crawldb. >> >> How does nutch handle redirects? >> >> Thanks in advance, >> Rafael. >> >> >> >> >> > > > -- > *Lewis*

