Hi Sebastian, Thanks for your reply. Couple of questions -
1. We have customized apache nutch jobs a bit like this. We have a separate parse job (ParseSegment.java) after fetch job (Fetcher.java). So as suggested above, if I use fetcher.store.content=false, I am assuming the "content" folder will not be created and hence our parse job won't work because it takes the content folder as an input file. Also, we have added an additional step "avroConversion" which takes input as "parse_data", "parse_text", "content" and "crawl_fetch" and converts into a specific avro schema defined by us. So I think, I will end up breaking a lot of things if I add fetcher.store.content=false and do parsing in the fetch phase only (fetcher.parse=true) [image: image.png] 2. In your earlier email, you said "a lot of code accessing the segments still assumes map files", which code are you referring to? In my use case above, we are not sending the crawled output to any indexers. In the avro conversion step, we just convert data into avro schema and dump to HDFS. Do you think we still need reducers in the fetch phase? FYI- I tried running with 0 reducers and don't see any impact as such. Appreciate your help. Regards Prateek On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel <[email protected]> wrote: > Hi Prateek, > > you're right there is no specific reducer used but without a reduce step > the segment data isn't (re)partitioned and the data isn't sorted. > This was a strong requirement once Nutch was a complete search engine > and the "content" subdir of a segment was used as page cache. > Getting the content from a segment is fast if the segment is partitioned > in a predictable way (hash partitioning) and map files are used. > > Well, this isn't a strong requirement anymore, since Nutch uses Solr, > Elasticsearch or other index services. But a lot of code accessing > the segments still assumes map files. Removing the reduce step from > the fetcher would also mean a lot of work in code and tools accessing > the segments, esp. to ensure backward compatibility. > > Have you tried to run the fetcher with > fetcher.parse=true > fetcher.store.content=false ? > This will save a lot of time and without the need to write the large > raw content the reduce phase should be fast, only a small fraction > (5-10%) of the fetcher map phase. > > Best, > Sebastian > > > On 7/20/20 11:38 PM, prateek sachdeva wrote: > > Hi Guys, > > > > As per Apache Nutch 1.16 Fetcher class implementation here - > > > https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java > , > > this is a map only job. I don't see any reducer set in the Job. So my > > question is why not set job.setNumreduceTasks(0) and save the time by > > outputting directly to HDFS. > > > > Regards > > Prateek > > > >

