Hi Lewis,
> 
> The honest truth is that there needs to be comprehensive documentation on
> the wiki for the way that Nutch handles redirects. This is a question that
> has gone fully unanswered for sometime.

That's true.

>  In the meantime, can you adivise if there is anything over
> and above the files in nutch-default.xml and o.a.n.protocol package which
> you would like to see documented?

I guess the poor documentation of nutch/hadoop is the biggest problem for
beginners like me. I started with nutch ~4-6 month ago (not full time, but 
several
hours every week). At first I wrote some plugins (parser/indexer). This was 
a bit tricky because i had learn directly from the source. Because most of
the tutorials/documents were outdated (<1.0) or simply wrong.

My crawler is now running and I need to scale it up. The current version
runs in local mode but thats not really fast. So I started to setup a hadoop
cluster (4 Nodes) to run nutch in the deploy mode. This is were I'm today and
my current questions are:

- i will buy some new hardware for the hadoop cluster, but i'm shure about
the configuration. Is nutch i/o or cpu heavy?

http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/

- what is the difference between protocol-httpclient and protocol-http? Just
ssl and authentication? What about performance?

- what is a good value for the following configuration parameter:
        - fetcher.threads.fetch
        - fetcher.threads.per.queue
        - mapred.tasktracker.map.tasks.maximum
        - mapred.tasktracker.reduce.tasks.maximum
        - mapred.map.tasks
        - mapred.reduce.tasks

        My current hardware is a 4 Node Cluster  of  dual CPU (quad core xeon), 
32GB RAM, 2*2TB SATA HDD. 
        I know it's impossible to define the "always right" value. But a rule 
of the thumb, to use as start value, would be very a great thing
        and would save me a lot of "try-and-error" investigation.

- what's the difference fetcher.threads.fetch from the configuration an the 
-threads option from the crawl
command?

- is it possible to follow external links only on 301 redirects?

- what is happening if a page is marked as db_redir_temp / db_redir_perm? 
        Refetch after db.fetch.interval.default?


I found loads tutorials and all of them have the "same" content, only the the
very very basics (how to do your first crawl). I guess a comprehensive 
documentation
would be a big step for the amazing nutch/hadoop project.

Thanks in advance,
Rafael.


> 
> Thanks
> 
> On Wed, Nov 16, 2011 at 7:17 PM, Rafael Pappert <[email protected]> wrote:
> 
>> Hello List,
>> 
>> is it possible to follow http 301 redirects immediately?
>> 
>> I tried to set http.redirect.max to 3 but the page is
>> still not indexed. readdb is still showing 1 page is
>> unfetched / db_redir_perm. And I can't find the
>> redirection target in the crawldb.
>> 
>> How does nutch handle redirects?
>> 
>> Thanks in advance,
>> Rafael.
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> *Lewis*

Reply via email to