Re: Apache Nutch 1.16 Fetcher reducers?

prateek sachdeva Tue, 21 Jul 2020 06:44:00 -0700

Correcting my statement below. I just realized that if I do 0 reducers in
the Fetch phase, I am not getting all the urls in output that I seeded in
input. Looks like only a few of them made it to the final output.
So something is not working as expected if we use 0 reducers in the Fetch
phase.


Regards
Prateek

On Tue, Jul 21, 2020 at 2:13 PM prateek sachdeva <[email protected]>
wrote:

> Makes complete sense. Agreed that 0 reducers in apache nutch fetcher won't
> make sense because of tooling that's built around it.
> Answering your questions - No, we have not made any changes to
> FetcherOutputFormat. Infact, the whole fetcher and parse job is the same as
> that of apache nutch 1.16(Fetcher.java and ParseSegment.java). We have
> built wrappers around these classes to run using Azkaban (
> https://azkaban.github.io/). And still it works if I assign 0 reducers in
> the Fetch phase.
>
> Final clarification - If I do fetcher.store.content=true and
> fetcher.parse=true, I don't need that Parse Job in my workflow and parsing
> will be done as part of fetcher flow only?
> Also, I agree with your point that if I modify FetcherOutputFormat to
> include avro conversion step, I might get rid of that as well. This will
> save some time for sure since Fetcher will be directly creating the final
> avro format that I need. So the only question remains is that if I do
> fetcher.parse=true, can I get rid of parse Job as a separate step
> completely.
>
> Regards
> Prateek
>
> On Tue, Jul 21, 2020 at 1:26 PM Sebastian Nagel
> <[email protected]> wrote:
>
>> Hi Prateek,
>>
>> (regarding 1.)
>>
>> It's also possible to combine fetcher.store.content=true and
>> fetcher.parse=true.
>> You might save some time unless the fetch job is CPU-bound - it usually
>> is limited by network and RAM for buffering content.
>>
>> > which code are you referring to?
>>
>> Maybe it isn't "a lot". The SegmentReader is assuming map files, and
>> there are probably
>> some more tools which also do.  If nothing is used in your workflow,
>> that's fine.
>> But if a fetcher without the reduce step should become the default for
>> Nutch, we'd
>> need to take care for all tools and also ensure backward-compatibility.
>>
>>
>> > FYI- I tried running with 0 reducers
>>
>> I assume you've also adapted FetcherOutputFormat ?
>>
>> Btw., you could think about inlining the "avroConversion" (or parts of
>> it) into FetcherOutputFormat which also could remove the need to
>> store the content.
>>
>> Best,
>> Sebastian
>>
>>
>> On 7/21/20 11:28 AM, prateek sachdeva wrote:
>> > Hi Sebastian,
>> >
>> > Thanks for your reply. Couple of questions -
>> >
>> > 1. We have customized apache nutch jobs a bit like this. We have a
>> separate parse job (ParseSegment.java) after fetch job (Fetcher.java). So
>> > as suggested above, if I use fetcher.store.content=false, I am assuming
>> the "content" folder will not be created and hence our parse job
>> > won't work because it takes the content folder as an input file. Also,
>> we have added an additional step "avroConversion" which takes input
>> > as "parse_data", "parse_text", "content" and "crawl_fetch" and converts
>> into a specific avro schema defined by us. So I think, I will end up
>> > breaking a lot of things if I add fetcher.store.content=false and do
>> parsing in the fetch phase only (fetcher.parse=true)
>> >
>> > image.png
>> >
>> > 2. In your earlier email, you said "a lot of code accessing the
>> segments still assumes map files", which code are you referring to? In my
>> > use case above, we are not sending the crawled output to any indexers.
>> In the avro conversion step, we just convert data into avro schema
>> > and dump to HDFS. Do you think we still need reducers in the fetch
>> phase? FYI- I tried running with 0 reducers and don't see any impact as
>> > such.
>> >
>> > Appreciate your help.
>> >
>> > Regards
>> > Prateek
>> >
>> > On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel <
>> [email protected]> wrote:
>> >
>> >     Hi Prateek,
>> >
>> >     you're right there is no specific reducer used but without a reduce
>> step
>> >     the segment data isn't (re)partitioned and the data isn't sorted.
>> >     This was a strong requirement once Nutch was a complete search
>> engine
>> >     and the "content" subdir of a segment was used as page cache.
>> >     Getting the content from a segment is fast if the segment is
>> partitioned
>> >     in a predictable way (hash partitioning) and map files are used.
>> >
>> >     Well, this isn't a strong requirement anymore, since Nutch uses
>> Solr,
>> >     Elasticsearch or other index services. But a lot of code accessing
>> >     the segments still assumes map files. Removing the reduce step from
>> >     the fetcher would also mean a lot of work in code and tools
>> accessing
>> >     the segments, esp. to ensure backward compatibility.
>> >
>> >     Have you tried to run the fetcher with
>> >      fetcher.parse=true
>> >      fetcher.store.content=false ?
>> >     This will save a lot of time and without the need to write the large
>> >     raw content the reduce phase should be fast, only a small fraction
>> >     (5-10%) of the fetcher map phase.
>> >
>> >     Best,
>> >     Sebastian
>> >
>> >
>> >     On 7/20/20 11:38 PM, prateek sachdeva wrote:
>> >     > Hi Guys,
>> >     >
>> >     > As per Apache Nutch 1.16 Fetcher class implementation here -
>> >     >
>> https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java
>> ,
>> >     > this is a map only job. I don't see any reducer set in the Job.
>> So my
>> >     > question is why not set job.setNumreduceTasks(0) and save the
>> time by
>> >     > outputting directly to HDFS.
>> >     >
>> >     > Regards
>> >     > Prateek
>> >     >
>> >
>>
>>

Re: Apache Nutch 1.16 Fetcher reducers?

Reply via email to