Re: When can the Nutch MapReduce job be considered complete?

S.L Mon, 10 Mar 2014 12:57:48 -0700

Can anyone please answer these questions ?


On Wed, Mar 5, 2014 at 7:52 PM, S.L <[email protected]> wrote:

> Marcus,
>
> Thanks! Please bear with me I have a few more questions..
>
>
>    1. I am confused when you mention that a no of cycles is determined by
>    the topN parameter , I was under the impression from reading the
>    documentation that topN determines the max number of links that are going
>    to be included at each level , so as to not miss any link I gave an
>    arbitrary value of 30,000 at each level. Is my understanding correct? If so
>    how can I reduce the delay that topN is causing ?
>    2. Apparently I am doing the exact same thing as you i.e indexing  in
>    the Fetcher.java by calling a utility method that takes the content and
>    populates the index(Solr) and not using the indexing feature provided by
>    Nutch . My understanding is that this is the Map phase of the Nutch Job and
>    the Reduce phase for Nucth is only relevant to compute the outlinks for
>    each URL , which I dont need because for me every URL /link is equally
>    important. The questions is that I still see Hadoop spending significant
>    amount of time during the reduce job. If my understanding is correct can I
>    disable the reduce phase for the Nutch job and how can I do so ?
>    3. When a page is crawled it is applied a certain criteria to
>    determine if its eligible for indexing , in my sample crawl of four
>    websites I end up with lesser than expected  number of documents in my
>    index , how do you suggest I implement a way to see how many pages were
>    crawled before the filtering criteria were applied in my utility method to
>    know how restrictive my filtering criteria is (like only 10% were indexed
>    or something)?
>
>
> I would really appreciate if you could take the time to answer my
> questions or provide me any leads.
>
>
> Thanks in advance!
>
>
> On Tue, Mar 4, 2014 at 9:58 AM, Markus Jelsma 
> <[email protected]>wrote:
>
>> Yes, the console shows you what it is doing, stdout as well.
>> In your case is it the depth that makes it take so long, it does 30.000
>> crawl cycles. We do cycles of around 1000-2000 and that takes between 10
>> and 15 minutes and we skip the indexing job (we index in the Fetcher). In
>> the end we do around 90-110 cycles every day so 30.000 would take us almost
>> a year! :)
>>
>> If your crawler does not finish all its records before default or
>> adaptiveinterval, it won't stop for a long time! :)
>>
>> -----Original message-----
>> > From:S.L <[email protected]>
>> > Sent: Tuesday 4th March 2014 8:09
>> > To: [email protected]
>> > Subject: When can the Nutch MapReduce job be considered complete?
>> >
>> > Hi All,
>> >
>> > I have set up  a psuedo distributed cluster using Hadoop 2.3 and runing
>> > Nutch 1.7 on it as a MapReduce Job and I use the following command to
>> > submit the job.
>> >
>> > /mnt/hadoop-2.3.0/bin/hadoop jar
>> > /opt/dfconfig/nutch/apache-nutch-1.8-SNAPSHOT.job
>> > org.apache.nutch.crawl.Crawl /urls -dir crawldirectory -depth 1000 -topN
>> > 30000
>> >
>> > I notice that the crawl is continuing even after 72 hours , I am only
>> > crawling 4 websites and have disabled outlinks to external domains .
>> Most
>> > of the pages are crawled for the first few hours but then the crawl
>> keeps
>> > on running and a very few pages are crawled in those extended crawl
>> > sessions. Is my high topN value causing this seemingly never ending
>> crawl ?
>> >
>> > How can I track the status ( from the Hadoop console  or otherwise) ?
>> >
>> > Thanks.
>> >
>>
>
>

Re: When can the Nutch MapReduce job be considered complete?

Reply via email to