Hi Prateek, (regarding 1.)
It's also possible to combine fetcher.store.content=true and fetcher.parse=true. You might save some time unless the fetch job is CPU-bound - it usually is limited by network and RAM for buffering content. > which code are you referring to? Maybe it isn't "a lot". The SegmentReader is assuming map files, and there are probably some more tools which also do. If nothing is used in your workflow, that's fine. But if a fetcher without the reduce step should become the default for Nutch, we'd need to take care for all tools and also ensure backward-compatibility. > FYI- I tried running with 0 reducers I assume you've also adapted FetcherOutputFormat ? Btw., you could think about inlining the "avroConversion" (or parts of it) into FetcherOutputFormat which also could remove the need to store the content. Best, Sebastian On 7/21/20 11:28 AM, prateek sachdeva wrote: > Hi Sebastian, > > Thanks for your reply. Couple of questions - > > 1. We have customized apache nutch jobs a bit like this. We have a separate > parse job (ParseSegment.java) after fetch job (Fetcher.java). So > as suggested above, if I use fetcher.store.content=false, I am assuming the > "content" folder will not be created and hence our parse job > won't work because it takes the content folder as an input file. Also, we > have added an additional step "avroConversion" which takes input > as "parse_data", "parse_text", "content" and "crawl_fetch" and converts into > a specific avro schema defined by us. So I think, I will end up > breaking a lot of things if I add fetcher.store.content=false and do parsing > in the fetch phase only (fetcher.parse=true) > > image.png > > 2. In your earlier email, you said "a lot of code accessing the segments > still assumes map files", which code are you referring to? In my > use case above, we are not sending the crawled output to any indexers. In the > avro conversion step, we just convert data into avro schema > and dump to HDFS. Do you think we still need reducers in the fetch phase? > FYI- I tried running with 0 reducers and don't see any impact as > such. > > Appreciate your help. > > Regards > Prateek > > On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel > <[email protected]> wrote: > > Hi Prateek, > > you're right there is no specific reducer used but without a reduce step > the segment data isn't (re)partitioned and the data isn't sorted. > This was a strong requirement once Nutch was a complete search engine > and the "content" subdir of a segment was used as page cache. > Getting the content from a segment is fast if the segment is partitioned > in a predictable way (hash partitioning) and map files are used. > > Well, this isn't a strong requirement anymore, since Nutch uses Solr, > Elasticsearch or other index services. But a lot of code accessing > the segments still assumes map files. Removing the reduce step from > the fetcher would also mean a lot of work in code and tools accessing > the segments, esp. to ensure backward compatibility. > > Have you tried to run the fetcher with > fetcher.parse=true > fetcher.store.content=false ? > This will save a lot of time and without the need to write the large > raw content the reduce phase should be fast, only a small fraction > (5-10%) of the fetcher map phase. > > Best, > Sebastian > > > On 7/20/20 11:38 PM, prateek sachdeva wrote: > > Hi Guys, > > > > As per Apache Nutch 1.16 Fetcher class implementation here - > > > https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java, > > this is a map only job. I don't see any reducer set in the Job. So my > > question is why not set job.setNumreduceTasks(0) and save the time by > > outputting directly to HDFS. > > > > Regards > > Prateek > > >

