Correcting my statement below. I just realized that if I do 0 reducers in the Fetch phase, I am not getting all the urls in output that I seeded in input. Looks like only a few of them made it to the final output. So something is not working as expected if we use 0 reducers in the Fetch phase.
Regards Prateek On Tue, Jul 21, 2020 at 2:13 PM prateek sachdeva <[email protected]> wrote: > Makes complete sense. Agreed that 0 reducers in apache nutch fetcher won't > make sense because of tooling that's built around it. > Answering your questions - No, we have not made any changes to > FetcherOutputFormat. Infact, the whole fetcher and parse job is the same as > that of apache nutch 1.16(Fetcher.java and ParseSegment.java). We have > built wrappers around these classes to run using Azkaban ( > https://azkaban.github.io/). And still it works if I assign 0 reducers in > the Fetch phase. > > Final clarification - If I do fetcher.store.content=true and > fetcher.parse=true, I don't need that Parse Job in my workflow and parsing > will be done as part of fetcher flow only? > Also, I agree with your point that if I modify FetcherOutputFormat to > include avro conversion step, I might get rid of that as well. This will > save some time for sure since Fetcher will be directly creating the final > avro format that I need. So the only question remains is that if I do > fetcher.parse=true, can I get rid of parse Job as a separate step > completely. > > Regards > Prateek > > On Tue, Jul 21, 2020 at 1:26 PM Sebastian Nagel > <[email protected]> wrote: > >> Hi Prateek, >> >> (regarding 1.) >> >> It's also possible to combine fetcher.store.content=true and >> fetcher.parse=true. >> You might save some time unless the fetch job is CPU-bound - it usually >> is limited by network and RAM for buffering content. >> >> > which code are you referring to? >> >> Maybe it isn't "a lot". The SegmentReader is assuming map files, and >> there are probably >> some more tools which also do. If nothing is used in your workflow, >> that's fine. >> But if a fetcher without the reduce step should become the default for >> Nutch, we'd >> need to take care for all tools and also ensure backward-compatibility. >> >> >> > FYI- I tried running with 0 reducers >> >> I assume you've also adapted FetcherOutputFormat ? >> >> Btw., you could think about inlining the "avroConversion" (or parts of >> it) into FetcherOutputFormat which also could remove the need to >> store the content. >> >> Best, >> Sebastian >> >> >> On 7/21/20 11:28 AM, prateek sachdeva wrote: >> > Hi Sebastian, >> > >> > Thanks for your reply. Couple of questions - >> > >> > 1. We have customized apache nutch jobs a bit like this. We have a >> separate parse job (ParseSegment.java) after fetch job (Fetcher.java). So >> > as suggested above, if I use fetcher.store.content=false, I am assuming >> the "content" folder will not be created and hence our parse job >> > won't work because it takes the content folder as an input file. Also, >> we have added an additional step "avroConversion" which takes input >> > as "parse_data", "parse_text", "content" and "crawl_fetch" and converts >> into a specific avro schema defined by us. So I think, I will end up >> > breaking a lot of things if I add fetcher.store.content=false and do >> parsing in the fetch phase only (fetcher.parse=true) >> > >> > image.png >> > >> > 2. In your earlier email, you said "a lot of code accessing the >> segments still assumes map files", which code are you referring to? In my >> > use case above, we are not sending the crawled output to any indexers. >> In the avro conversion step, we just convert data into avro schema >> > and dump to HDFS. Do you think we still need reducers in the fetch >> phase? FYI- I tried running with 0 reducers and don't see any impact as >> > such. >> > >> > Appreciate your help. >> > >> > Regards >> > Prateek >> > >> > On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel < >> [email protected]> wrote: >> > >> > Hi Prateek, >> > >> > you're right there is no specific reducer used but without a reduce >> step >> > the segment data isn't (re)partitioned and the data isn't sorted. >> > This was a strong requirement once Nutch was a complete search >> engine >> > and the "content" subdir of a segment was used as page cache. >> > Getting the content from a segment is fast if the segment is >> partitioned >> > in a predictable way (hash partitioning) and map files are used. >> > >> > Well, this isn't a strong requirement anymore, since Nutch uses >> Solr, >> > Elasticsearch or other index services. But a lot of code accessing >> > the segments still assumes map files. Removing the reduce step from >> > the fetcher would also mean a lot of work in code and tools >> accessing >> > the segments, esp. to ensure backward compatibility. >> > >> > Have you tried to run the fetcher with >> > fetcher.parse=true >> > fetcher.store.content=false ? >> > This will save a lot of time and without the need to write the large >> > raw content the reduce phase should be fast, only a small fraction >> > (5-10%) of the fetcher map phase. >> > >> > Best, >> > Sebastian >> > >> > >> > On 7/20/20 11:38 PM, prateek sachdeva wrote: >> > > Hi Guys, >> > > >> > > As per Apache Nutch 1.16 Fetcher class implementation here - >> > > >> https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java >> , >> > > this is a map only job. I don't see any reducer set in the Job. >> So my >> > > question is why not set job.setNumreduceTasks(0) and save the >> time by >> > > outputting directly to HDFS. >> > > >> > > Regards >> > > Prateek >> > > >> > >> >>

