ctly Thanks a lot Sebastian. Yes, after checking the logs i saw "key out of order exception" and realized that MapFile expects entries to be in order and MapFile is used in FetcherOutputFormat while writing data to HDFS. I might have to create my own custom FetcherOutputFormat to allow out of order writes. I will check how I can do that.
I will also try to merge parsing and avro conversion to fetch Job directly so see if there are some improvements. I have also concluded this discussion here - https://stackoverflow.com/questions/63003881/apache-nutch-1-16-fetcher-reducers/. So if you want to add something here, please feel free to do so. Regards Prateek On Tue, Jul 21, 2020 at 7:50 PM Sebastian Nagel <[email protected]> wrote: > Hi Prateek, > > > if I do 0 reducers in > > the Fetch phase, I am not getting all the urls in output that I seeded in > > input. Looks like only a few of them made it to the final output. > > There should be error messages in the task logs caused by output not sorted > by URL (used as key in map files). > > > >> Final clarification - If I do fetcher.store.content=true and > >> fetcher.parse=true, I don't need that Parse Job in my workflow and > parsing > >> will be done as part of fetcher flow only? > > Yes, parsing is then done in the fetcher and the parse output is written to > crawl_parse, parse_text and parse_data. > > Best, > Sebastian > > On 7/21/20 3:42 PM, prateek sachdeva wrote: > > Correcting my statement below. I just realized that if I do 0 reducers in > > the Fetch phase, I am not getting all the urls in output that I seeded in > > input. Looks like only a few of them made it to the final output. > > So something is not working as expected if we use 0 reducers in the Fetch > > phase. > > > > Regards > > Prateek > > > > On Tue, Jul 21, 2020 at 2:13 PM prateek sachdeva <[email protected]> > > wrote: > > > >> Makes complete sense. Agreed that 0 reducers in apache nutch fetcher > won't > >> make sense because of tooling that's built around it. > >> Answering your questions - No, we have not made any changes to > >> FetcherOutputFormat. Infact, the whole fetcher and parse job is the > same as > >> that of apache nutch 1.16(Fetcher.java and ParseSegment.java). We have > >> built wrappers around these classes to run using Azkaban ( > >> https://azkaban.github.io/). And still it works if I assign 0 reducers > in > >> the Fetch phase. > >> > >> Final clarification - If I do fetcher.store.content=true and > >> fetcher.parse=true, I don't need that Parse Job in my workflow and > parsing > >> will be done as part of fetcher flow only? > >> Also, I agree with your point that if I modify FetcherOutputFormat to > >> include avro conversion step, I might get rid of that as well. This will > >> save some time for sure since Fetcher will be directly creating the > final > >> avro format that I need. So the only question remains is that if I do > >> fetcher.parse=true, can I get rid of parse Job as a separate step > >> completely. > >> > >> Regards > >> Prateek > >> > >> On Tue, Jul 21, 2020 at 1:26 PM Sebastian Nagel > >> <[email protected]> wrote: > >> > >>> Hi Prateek, > >>> > >>> (regarding 1.) > >>> > >>> It's also possible to combine fetcher.store.content=true and > >>> fetcher.parse=true. > >>> You might save some time unless the fetch job is CPU-bound - it usually > >>> is limited by network and RAM for buffering content. > >>> > >>>> which code are you referring to? > >>> > >>> Maybe it isn't "a lot". The SegmentReader is assuming map files, and > >>> there are probably > >>> some more tools which also do. If nothing is used in your workflow, > >>> that's fine. > >>> But if a fetcher without the reduce step should become the default for > >>> Nutch, we'd > >>> need to take care for all tools and also ensure backward-compatibility. > >>> > >>> > >>>> FYI- I tried running with 0 reducers > >>> > >>> I assume you've also adapted FetcherOutputFormat ? > >>> > >>> Btw., you could think about inlining the "avroConversion" (or parts of > >>> it) into FetcherOutputFormat which also could remove the need to > >>> store the content. > >>> > >>> Best, > >>> Sebastian > >>> > >>> > >>> On 7/21/20 11:28 AM, prateek sachdeva wrote: > >>>> Hi Sebastian, > >>>> > >>>> Thanks for your reply. Couple of questions - > >>>> > >>>> 1. We have customized apache nutch jobs a bit like this. We have a > >>> separate parse job (ParseSegment.java) after fetch job (Fetcher.java). > So > >>>> as suggested above, if I use fetcher.store.content=false, I am > assuming > >>> the "content" folder will not be created and hence our parse job > >>>> won't work because it takes the content folder as an input file. Also, > >>> we have added an additional step "avroConversion" which takes input > >>>> as "parse_data", "parse_text", "content" and "crawl_fetch" and > converts > >>> into a specific avro schema defined by us. So I think, I will end up > >>>> breaking a lot of things if I add fetcher.store.content=false and do > >>> parsing in the fetch phase only (fetcher.parse=true) > >>>> > >>>> image.png > >>>> > >>>> 2. In your earlier email, you said "a lot of code accessing the > >>> segments still assumes map files", which code are you referring to? In > my > >>>> use case above, we are not sending the crawled output to any indexers. > >>> In the avro conversion step, we just convert data into avro schema > >>>> and dump to HDFS. Do you think we still need reducers in the fetch > >>> phase? FYI- I tried running with 0 reducers and don't see any impact as > >>>> such. > >>>> > >>>> Appreciate your help. > >>>> > >>>> Regards > >>>> Prateek > >>>> > >>>> On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel < > >>> [email protected]> wrote: > >>>> > >>>> Hi Prateek, > >>>> > >>>> you're right there is no specific reducer used but without a > reduce > >>> step > >>>> the segment data isn't (re)partitioned and the data isn't sorted. > >>>> This was a strong requirement once Nutch was a complete search > >>> engine > >>>> and the "content" subdir of a segment was used as page cache. > >>>> Getting the content from a segment is fast if the segment is > >>> partitioned > >>>> in a predictable way (hash partitioning) and map files are used. > >>>> > >>>> Well, this isn't a strong requirement anymore, since Nutch uses > >>> Solr, > >>>> Elasticsearch or other index services. But a lot of code accessing > >>>> the segments still assumes map files. Removing the reduce step > from > >>>> the fetcher would also mean a lot of work in code and tools > >>> accessing > >>>> the segments, esp. to ensure backward compatibility. > >>>> > >>>> Have you tried to run the fetcher with > >>>> fetcher.parse=true > >>>> fetcher.store.content=false ? > >>>> This will save a lot of time and without the need to write the > large > >>>> raw content the reduce phase should be fast, only a small fraction > >>>> (5-10%) of the fetcher map phase. > >>>> > >>>> Best, > >>>> Sebastian > >>>> > >>>> > >>>> On 7/20/20 11:38 PM, prateek sachdeva wrote: > >>>> > Hi Guys, > >>>> > > >>>> > As per Apache Nutch 1.16 Fetcher class implementation here - > >>>> > > >>> > https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java > >>> , > >>>> > this is a map only job. I don't see any reducer set in the Job. > >>> So my > >>>> > question is why not set job.setNumreduceTasks(0) and save the > >>> time by > >>>> > outputting directly to HDFS. > >>>> > > >>>> > Regards > >>>> > Prateek > >>>> > > >>>> > >>> > >>> > > > >

