> might have to create my own custom FetcherOutputFormat to allow out of > order writes. I will check how I can do that.
Just replace the MapFile.Writer by a SequenceFile.Writer Eventually, this may require further changes. > I have also concluded this discussion here - > https://stackoverflow.com/questions/63003881/apache-nutch-1-16-fetcher-reducers/. Thanks for updating the discussion there! On 7/22/20 4:09 PM, prateek sachdeva wrote: > ctly Thanks a lot Sebastian. Yes, after checking the logs i saw "key out of > order exception" and realized that MapFile expects entries to be in order > and MapFile is used in FetcherOutputFormat while writing data to HDFS. I > might have to create my own custom FetcherOutputFormat to allow out of > order writes. I will check how I can do that. > > I will also try to merge parsing and avro conversion to fetch Job directly > so see if there are some improvements. > > I have also concluded this discussion here - > https://stackoverflow.com/questions/63003881/apache-nutch-1-16-fetcher-reducers/. > So if you want to add something here, please feel free to do so. > > Regards > Prateek > > On Tue, Jul 21, 2020 at 7:50 PM Sebastian Nagel > <[email protected]> wrote: > >> Hi Prateek, >> >>> if I do 0 reducers in >>> the Fetch phase, I am not getting all the urls in output that I seeded in >>> input. Looks like only a few of them made it to the final output. >> >> There should be error messages in the task logs caused by output not sorted >> by URL (used as key in map files). >> >> >>>> Final clarification - If I do fetcher.store.content=true and >>>> fetcher.parse=true, I don't need that Parse Job in my workflow and >> parsing >>>> will be done as part of fetcher flow only? >> >> Yes, parsing is then done in the fetcher and the parse output is written to >> crawl_parse, parse_text and parse_data. >> >> Best, >> Sebastian >> >> On 7/21/20 3:42 PM, prateek sachdeva wrote: >>> Correcting my statement below. I just realized that if I do 0 reducers in >>> the Fetch phase, I am not getting all the urls in output that I seeded in >>> input. Looks like only a few of them made it to the final output. >>> So something is not working as expected if we use 0 reducers in the Fetch >>> phase. >>> >>> Regards >>> Prateek >>> >>> On Tue, Jul 21, 2020 at 2:13 PM prateek sachdeva <[email protected]> >>> wrote: >>> >>>> Makes complete sense. Agreed that 0 reducers in apache nutch fetcher >> won't >>>> make sense because of tooling that's built around it. >>>> Answering your questions - No, we have not made any changes to >>>> FetcherOutputFormat. Infact, the whole fetcher and parse job is the >> same as >>>> that of apache nutch 1.16(Fetcher.java and ParseSegment.java). We have >>>> built wrappers around these classes to run using Azkaban ( >>>> https://azkaban.github.io/). And still it works if I assign 0 reducers >> in >>>> the Fetch phase. >>>> >>>> Final clarification - If I do fetcher.store.content=true and >>>> fetcher.parse=true, I don't need that Parse Job in my workflow and >> parsing >>>> will be done as part of fetcher flow only? >>>> Also, I agree with your point that if I modify FetcherOutputFormat to >>>> include avro conversion step, I might get rid of that as well. This will >>>> save some time for sure since Fetcher will be directly creating the >> final >>>> avro format that I need. So the only question remains is that if I do >>>> fetcher.parse=true, can I get rid of parse Job as a separate step >>>> completely. >>>> >>>> Regards >>>> Prateek >>>> >>>> On Tue, Jul 21, 2020 at 1:26 PM Sebastian Nagel >>>> <[email protected]> wrote: >>>> >>>>> Hi Prateek, >>>>> >>>>> (regarding 1.) >>>>> >>>>> It's also possible to combine fetcher.store.content=true and >>>>> fetcher.parse=true. >>>>> You might save some time unless the fetch job is CPU-bound - it usually >>>>> is limited by network and RAM for buffering content. >>>>> >>>>>> which code are you referring to? >>>>> >>>>> Maybe it isn't "a lot". The SegmentReader is assuming map files, and >>>>> there are probably >>>>> some more tools which also do. If nothing is used in your workflow, >>>>> that's fine. >>>>> But if a fetcher without the reduce step should become the default for >>>>> Nutch, we'd >>>>> need to take care for all tools and also ensure backward-compatibility. >>>>> >>>>> >>>>>> FYI- I tried running with 0 reducers >>>>> >>>>> I assume you've also adapted FetcherOutputFormat ? >>>>> >>>>> Btw., you could think about inlining the "avroConversion" (or parts of >>>>> it) into FetcherOutputFormat which also could remove the need to >>>>> store the content. >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>> >>>>> On 7/21/20 11:28 AM, prateek sachdeva wrote: >>>>>> Hi Sebastian, >>>>>> >>>>>> Thanks for your reply. Couple of questions - >>>>>> >>>>>> 1. We have customized apache nutch jobs a bit like this. We have a >>>>> separate parse job (ParseSegment.java) after fetch job (Fetcher.java). >> So >>>>>> as suggested above, if I use fetcher.store.content=false, I am >> assuming >>>>> the "content" folder will not be created and hence our parse job >>>>>> won't work because it takes the content folder as an input file. Also, >>>>> we have added an additional step "avroConversion" which takes input >>>>>> as "parse_data", "parse_text", "content" and "crawl_fetch" and >> converts >>>>> into a specific avro schema defined by us. So I think, I will end up >>>>>> breaking a lot of things if I add fetcher.store.content=false and do >>>>> parsing in the fetch phase only (fetcher.parse=true) >>>>>> >>>>>> image.png >>>>>> >>>>>> 2. In your earlier email, you said "a lot of code accessing the >>>>> segments still assumes map files", which code are you referring to? In >> my >>>>>> use case above, we are not sending the crawled output to any indexers. >>>>> In the avro conversion step, we just convert data into avro schema >>>>>> and dump to HDFS. Do you think we still need reducers in the fetch >>>>> phase? FYI- I tried running with 0 reducers and don't see any impact as >>>>>> such. >>>>>> >>>>>> Appreciate your help. >>>>>> >>>>>> Regards >>>>>> Prateek >>>>>> >>>>>> On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel < >>>>> [email protected]> wrote: >>>>>> >>>>>> Hi Prateek, >>>>>> >>>>>> you're right there is no specific reducer used but without a >> reduce >>>>> step >>>>>> the segment data isn't (re)partitioned and the data isn't sorted. >>>>>> This was a strong requirement once Nutch was a complete search >>>>> engine >>>>>> and the "content" subdir of a segment was used as page cache. >>>>>> Getting the content from a segment is fast if the segment is >>>>> partitioned >>>>>> in a predictable way (hash partitioning) and map files are used. >>>>>> >>>>>> Well, this isn't a strong requirement anymore, since Nutch uses >>>>> Solr, >>>>>> Elasticsearch or other index services. But a lot of code accessing >>>>>> the segments still assumes map files. Removing the reduce step >> from >>>>>> the fetcher would also mean a lot of work in code and tools >>>>> accessing >>>>>> the segments, esp. to ensure backward compatibility. >>>>>> >>>>>> Have you tried to run the fetcher with >>>>>> fetcher.parse=true >>>>>> fetcher.store.content=false ? >>>>>> This will save a lot of time and without the need to write the >> large >>>>>> raw content the reduce phase should be fast, only a small fraction >>>>>> (5-10%) of the fetcher map phase. >>>>>> >>>>>> Best, >>>>>> Sebastian >>>>>> >>>>>> >>>>>> On 7/20/20 11:38 PM, prateek sachdeva wrote: >>>>>> > Hi Guys, >>>>>> > >>>>>> > As per Apache Nutch 1.16 Fetcher class implementation here - >>>>>> > >>>>> >> https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java >>>>> , >>>>>> > this is a map only job. I don't see any reducer set in the Job. >>>>> So my >>>>>> > question is why not set job.setNumreduceTasks(0) and save the >>>>> time by >>>>>> > outputting directly to HDFS. >>>>>> > >>>>>> > Regards >>>>>> > Prateek >>>>>> > >>>>>> >>>>> >>>>> >>> >> >> >

