Hi Rafael,

The page we are talking about will be added on the link below.

http://wiki.apache.org/nutch/InternalDocumentation

and will be available here

http://wiki.apache.org/nutch/RedirectHandling


> I guess the poor documentation of nutch/hadoop is the biggest problem for
> beginners like me. I started with nutch ~4-6 month ago (not full time, but
> several
> hours every week). At first I wrote some plugins (parser/indexer). This was
> a bit tricky because i had learn directly from the source. Because most of
> the tutorials/documents were outdated (<1.0) or simply wrong.
>

Please note we are trying to remove as much duplication documentation
regarding Nutch & Hadoop as possible. The Nutch wiki has been updated
recently and this is ongoing work so hopefully we can improve this more in
the near future. As Nutch focuses purely on web crawling the Hadoop
material can be viewed directly in the Hadoop wiki. I've added a link to
this on our wiki Nutch Hadoop Tutorial.


> My crawler is now running and I need to scale it up. The current version
> runs in local mode but thats not really fast. So I started to setup a
> hadoop
> cluster (4 Nodes) to run nutch in the deploy mode. This is were I'm today
> and
> my current questions are:
>
> - i will buy some new hardware for the hadoop cluster, but i'm shure about
> the configuration. Is nutch i/o or cpu heavy?
>

On a brand new hardware configuration I have not hard of anyone blowing
gaskets or anything similar. If thereis something wrong, it can usually be
fixed by improving configuration.


>
> - what is the difference between protocol-httpclient and protocol-http?
> Just
> ssl and authentication? What about performance?
>

protocol-httpclient is broken, please see the jira issue that has been
filed. You will also need to have a look at the code for this as I am by no
means an expert with the protocol-httpclient material.

>
> - what is a good value for the following configuration parameter:
>        - fetcher.threads.fetch
>        - fetcher.threads.per.queue
>        - mapred.tasktracker.map.tasks.maximum
>        - mapred.tasktracker.reduce.tasks.maximum
>        - mapred.map.tasks
>        - mapred.reduce.tasks
>

Impossible to say, this varies significantly from crawl/network/nature of
crawl data etc. You simply need to experiment and read as much existing
documentation as possible. Sorry about this one.

>
>        My current hardware is a 4 Node Cluster  of  dual CPU (quad core
> xeon), 32GB RAM, 2*2TB SATA HDD.
>        I know it's impossible to define the "always right" value. But a
> rule of the thumb, to use as start value, would be very a great thing
>        and would save me a lot of "try-and-error" investigation.
>

Unfortunately this open source software you are using. Maybe Cloudera or
some of the other commercially motivated experts can help you with this
stuff. This is outwith my experience. Try here
http://wiki.apache.org/nutch/Support


> - what's the difference fetcher.threads.fetch from the configuration an
> the -threads option from the crawl
> command?
>
This depends on how you wish to monitor/schedule your Nutch crawls. As you
know, running individual commands gives you more flexibility/control over
how Nutch does the work for you.

>
> - is it possible to follow external links only on 301 redirects?
>
Not got a clue but will definitely include this type of material in the
wiki page I created above. Mayeb you can do a bit of investigation and halp
me out when I get round to writing up on this stuff.


>
> - what is happening if a page is marked as db_redir_temp / db_redir_perm?
>        Refetch after db.fetch.interval.default?
>
> Again we will need to work together to get our heads around this, if you
have a look at the code then maybe we can get somethign written up in due
course.

Sorry about the vague answers however its a pretty large task to answer
everything fully considering there are ~5-10 questions all in. I'm sure
there must be some material on the user@ archives so please have a look
there as well.

hth

Lewis

Reply via email to