Hi Prateek, you're right there is no specific reducer used but without a reduce step the segment data isn't (re)partitioned and the data isn't sorted. This was a strong requirement once Nutch was a complete search engine and the "content" subdir of a segment was used as page cache. Getting the content from a segment is fast if the segment is partitioned in a predictable way (hash partitioning) and map files are used.
Well, this isn't a strong requirement anymore, since Nutch uses Solr, Elasticsearch or other index services. But a lot of code accessing the segments still assumes map files. Removing the reduce step from the fetcher would also mean a lot of work in code and tools accessing the segments, esp. to ensure backward compatibility. Have you tried to run the fetcher with fetcher.parse=true fetcher.store.content=false ? This will save a lot of time and without the need to write the large raw content the reduce phase should be fast, only a small fraction (5-10%) of the fetcher map phase. Best, Sebastian On 7/20/20 11:38 PM, prateek sachdeva wrote: > Hi Guys, > > As per Apache Nutch 1.16 Fetcher class implementation here - > https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java, > this is a map only job. I don't see any reducer set in the Job. So my > question is why not set job.setNumreduceTasks(0) and save the time by > outputting directly to HDFS. > > Regards > Prateek >

