Here is my understanding:
Nutch is using mapreduce everywhere, looking at the source code of Nutch
even 1.x, (1.8 in this case), just in the nutch/crawl folder. here are the
files that imports hadoop.mapred:
$ grep 'hadoop.mapred' * | awk 'BEGIN{FS=":"}{print $1}' | sort | uniq
CrawlDb.java
CrawlDbFilter.java
CrawlDbMerger.java
CrawlDbReader.java
CrawlDbReducer.java
DeduplicationJob.java
Generator.java
Injector.java
LinkDb.java
LinkDbFilter.java
LinkDbMerger.java
LinkDbReader.java
URLPartitioner.java
And for example, in the CrawDb.java, the code looks like this:
public void update(Path crawlDb, ...) throws IOException {
FileSystem fs = FileSystem.get(getConf());
...
JobConf job = CrawlDb.createJob(getConf(), crawlDb);
...
Based on my understanding, it is reading the hadoop system configuration
and tell the job, hey, here are all the nodes that you can use...
And also, there is a reducer in that job... which crawdbReducer..... which
needs the reducer to "/** Merge new page entries with existing entries. */".
In conclusion, there are several steps which are all implemented using
mapreduce.
Correct me if I was wrong.
Bin
On Tue, Apr 29, 2014 at 8:57 AM, S.L <[email protected]> wrote:
> Hi All,
>
> I am running Nutch on a single node Hadoop cluster , I do not use a
> indexing URL and I have disabled the LinkInversion phase as I do not need
> any scores to be attached to any URL.
>
> My question is that if LinkInversion phase in Nutch is the only phase that
> requires the Reduce task to be run , as since I have disabled it in the
> Crawl.java class, can I go ahead and set the number of reduce tasks in
> Hadoop job submission to zero, or is there any other phase that still
> requires a reduce tasks.
>