Here is my understanding:

Nutch is using mapreduce everywhere, looking at the source code of Nutch
even 1.x, (1.8 in this case), just in the nutch/crawl folder. here are the
files that imports hadoop.mapred:

$ grep 'hadoop.mapred' * | awk 'BEGIN{FS=":"}{print $1}' | sort | uniq
CrawlDb.java
CrawlDbFilter.java
CrawlDbMerger.java
CrawlDbReader.java
CrawlDbReducer.java
DeduplicationJob.java
Generator.java
Injector.java
LinkDb.java
LinkDbFilter.java
LinkDbMerger.java
LinkDbReader.java
URLPartitioner.java

And for example, in the CrawDb.java, the code looks like this:

 public void update(Path crawlDb, ...) throws IOException {

    FileSystem fs = FileSystem.get(getConf());

    ...

    JobConf job = CrawlDb.createJob(getConf(), crawlDb);

    ...

Based on my understanding, it is reading the hadoop system configuration
and tell the job, hey, here are all the nodes that you can use...

And also, there is a reducer in that job... which crawdbReducer..... which
needs the reducer to "/** Merge new page entries with existing entries. */".


In conclusion, there are several steps which are all implemented using
mapreduce.


Correct me if I was wrong.


Bin





On Tue, Apr 29, 2014 at 8:57 AM, S.L <[email protected]> wrote:

> Hi All,
>
> I am running Nutch on a single node Hadoop cluster  , I do not use a
> indexing URL and I have disabled the LinkInversion phase as I do not need
> any scores to be attached to any URL.
>
> My question is that if LinkInversion phase in Nutch is the only phase that
> requires the Reduce task to be run , as since I have disabled it in the
> Crawl.java class, can I go ahead and set the number of reduce tasks in
> Hadoop job submission to zero, or is there any other phase that still
> requires a reduce tasks.
>

Reply via email to