Hi Prateek,

> if I do 0 reducers in
> the Fetch phase, I am not getting all the urls in output that I seeded in
> input. Looks like only a few of them made it to the final output.

There should be error messages in the task logs caused by output not sorted
by URL (used as key in map files).


>> Final clarification - If I do fetcher.store.content=true and
>> fetcher.parse=true, I don't need that Parse Job in my workflow and parsing
>> will be done as part of fetcher flow only?

Yes, parsing is then done in the fetcher and the parse output is written to
crawl_parse, parse_text and parse_data.

Best,
Sebastian

On 7/21/20 3:42 PM, prateek sachdeva wrote:
> Correcting my statement below. I just realized that if I do 0 reducers in
> the Fetch phase, I am not getting all the urls in output that I seeded in
> input. Looks like only a few of them made it to the final output.
> So something is not working as expected if we use 0 reducers in the Fetch
> phase.
> 
> Regards
> Prateek
> 
> On Tue, Jul 21, 2020 at 2:13 PM prateek sachdeva <[email protected]>
> wrote:
> 
>> Makes complete sense. Agreed that 0 reducers in apache nutch fetcher won't
>> make sense because of tooling that's built around it.
>> Answering your questions - No, we have not made any changes to
>> FetcherOutputFormat. Infact, the whole fetcher and parse job is the same as
>> that of apache nutch 1.16(Fetcher.java and ParseSegment.java). We have
>> built wrappers around these classes to run using Azkaban (
>> https://azkaban.github.io/). And still it works if I assign 0 reducers in
>> the Fetch phase.
>>
>> Final clarification - If I do fetcher.store.content=true and
>> fetcher.parse=true, I don't need that Parse Job in my workflow and parsing
>> will be done as part of fetcher flow only?
>> Also, I agree with your point that if I modify FetcherOutputFormat to
>> include avro conversion step, I might get rid of that as well. This will
>> save some time for sure since Fetcher will be directly creating the final
>> avro format that I need. So the only question remains is that if I do
>> fetcher.parse=true, can I get rid of parse Job as a separate step
>> completely.
>>
>> Regards
>> Prateek
>>
>> On Tue, Jul 21, 2020 at 1:26 PM Sebastian Nagel
>> <[email protected]> wrote:
>>
>>> Hi Prateek,
>>>
>>> (regarding 1.)
>>>
>>> It's also possible to combine fetcher.store.content=true and
>>> fetcher.parse=true.
>>> You might save some time unless the fetch job is CPU-bound - it usually
>>> is limited by network and RAM for buffering content.
>>>
>>>> which code are you referring to?
>>>
>>> Maybe it isn't "a lot". The SegmentReader is assuming map files, and
>>> there are probably
>>> some more tools which also do.  If nothing is used in your workflow,
>>> that's fine.
>>> But if a fetcher without the reduce step should become the default for
>>> Nutch, we'd
>>> need to take care for all tools and also ensure backward-compatibility.
>>>
>>>
>>>> FYI- I tried running with 0 reducers
>>>
>>> I assume you've also adapted FetcherOutputFormat ?
>>>
>>> Btw., you could think about inlining the "avroConversion" (or parts of
>>> it) into FetcherOutputFormat which also could remove the need to
>>> store the content.
>>>
>>> Best,
>>> Sebastian
>>>
>>>
>>> On 7/21/20 11:28 AM, prateek sachdeva wrote:
>>>> Hi Sebastian,
>>>>
>>>> Thanks for your reply. Couple of questions -
>>>>
>>>> 1. We have customized apache nutch jobs a bit like this. We have a
>>> separate parse job (ParseSegment.java) after fetch job (Fetcher.java). So
>>>> as suggested above, if I use fetcher.store.content=false, I am assuming
>>> the "content" folder will not be created and hence our parse job
>>>> won't work because it takes the content folder as an input file. Also,
>>> we have added an additional step "avroConversion" which takes input
>>>> as "parse_data", "parse_text", "content" and "crawl_fetch" and converts
>>> into a specific avro schema defined by us. So I think, I will end up
>>>> breaking a lot of things if I add fetcher.store.content=false and do
>>> parsing in the fetch phase only (fetcher.parse=true)
>>>>
>>>> image.png
>>>>
>>>> 2. In your earlier email, you said "a lot of code accessing the
>>> segments still assumes map files", which code are you referring to? In my
>>>> use case above, we are not sending the crawled output to any indexers.
>>> In the avro conversion step, we just convert data into avro schema
>>>> and dump to HDFS. Do you think we still need reducers in the fetch
>>> phase? FYI- I tried running with 0 reducers and don't see any impact as
>>>> such.
>>>>
>>>> Appreciate your help.
>>>>
>>>> Regards
>>>> Prateek
>>>>
>>>> On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel <
>>> [email protected]> wrote:
>>>>
>>>>     Hi Prateek,
>>>>
>>>>     you're right there is no specific reducer used but without a reduce
>>> step
>>>>     the segment data isn't (re)partitioned and the data isn't sorted.
>>>>     This was a strong requirement once Nutch was a complete search
>>> engine
>>>>     and the "content" subdir of a segment was used as page cache.
>>>>     Getting the content from a segment is fast if the segment is
>>> partitioned
>>>>     in a predictable way (hash partitioning) and map files are used.
>>>>
>>>>     Well, this isn't a strong requirement anymore, since Nutch uses
>>> Solr,
>>>>     Elasticsearch or other index services. But a lot of code accessing
>>>>     the segments still assumes map files. Removing the reduce step from
>>>>     the fetcher would also mean a lot of work in code and tools
>>> accessing
>>>>     the segments, esp. to ensure backward compatibility.
>>>>
>>>>     Have you tried to run the fetcher with
>>>>      fetcher.parse=true
>>>>      fetcher.store.content=false ?
>>>>     This will save a lot of time and without the need to write the large
>>>>     raw content the reduce phase should be fast, only a small fraction
>>>>     (5-10%) of the fetcher map phase.
>>>>
>>>>     Best,
>>>>     Sebastian
>>>>
>>>>
>>>>     On 7/20/20 11:38 PM, prateek sachdeva wrote:
>>>>     > Hi Guys,
>>>>     >
>>>>     > As per Apache Nutch 1.16 Fetcher class implementation here -
>>>>     >
>>> https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java
>>> ,
>>>>     > this is a map only job. I don't see any reducer set in the Job.
>>> So my
>>>>     > question is why not set job.setNumreduceTasks(0) and save the
>>> time by
>>>>     > outputting directly to HDFS.
>>>>     >
>>>>     > Regards
>>>>     > Prateek
>>>>     >
>>>>
>>>
>>>
> 

Reply via email to