Hi Prateek,

you're right there is no specific reducer used but without a reduce step
the segment data isn't (re)partitioned and the data isn't sorted.
This was a strong requirement once Nutch was a complete search engine
and the "content" subdir of a segment was used as page cache.
Getting the content from a segment is fast if the segment is partitioned
in a predictable way (hash partitioning) and map files are used.

Well, this isn't a strong requirement anymore, since Nutch uses Solr,
Elasticsearch or other index services. But a lot of code accessing
the segments still assumes map files. Removing the reduce step from
the fetcher would also mean a lot of work in code and tools accessing
the segments, esp. to ensure backward compatibility.

Have you tried to run the fetcher with
 fetcher.parse=true
 fetcher.store.content=false ?
This will save a lot of time and without the need to write the large
raw content the reduce phase should be fast, only a small fraction
(5-10%) of the fetcher map phase.

Best,
Sebastian


On 7/20/20 11:38 PM, prateek sachdeva wrote:
> Hi Guys,
> 
> As per Apache Nutch 1.16 Fetcher class implementation here -
> https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java,
> this is a map only job. I don't see any reducer set in the Job. So my
> question is why not set job.setNumreduceTasks(0) and save the time by
> outputting directly to HDFS.
> 
> Regards
> Prateek
> 

Reply via email to