Re: When can the Nutch MapReduce job be considered complete?

S.L Tue, 11 Mar 2014 11:44:47 -0700

Thanks feng , I will use the readdb command.


On Mon, Mar 10, 2014 at 10:13 PM, feng lu <[email protected]> wrote:

> <<
> 1. I am confused when you mention that a no of cycles is determined by
>    the topN parameter , I was under the impression from reading the
>    documentation that topN determines the max number of links that are
> going
>    to be included at each level , so as to not miss any link I gave an
>    arbitrary value of 30,000 at each level. Is my understanding correct? If
> so
>    how can I reduce the delay that topN is causing ?
> >>
> yes, you are right, the 30,000 is set to the number of top urls to be
> selected in each crawl segment.
>
> <<
>    2. Apparently I am doing the exact same thing as you i.e indexing  in
>    the Fetcher.java by calling a utility method that takes the content and
>    populates the index(Solr) and not using the indexing feature provided by
>    Nutch . My understanding is that this is the Map phase of the Nutch Job
> and
>    the Reduce phase for Nucth is only relevant to compute the outlinks for
>    each URL , which I dont need because for me every URL /link is equally
>    important. The questions is that I still see Hadoop spending significant
>    amount of time during the reduce job. If my understanding is correct can
> I
>    disable the reduce phase for the Nutch job and how can I do so ?
> >>
> I see the Fetcher class is a MapRunnable Implementation, it will not has
> Reduce phase. Which Reduce phase for Nutch do you ask? I see in Parse phase
> it will extract all urls in content and calculate the score for each url.
>
> <<
>    3. When a page is crawled it is applied a certain criteria to determine
>    if its eligible for indexing , in my sample crawl of four websites I end
> up
>    with lesser than expected  number of documents in my index , how do you
>    suggest I implement a way to see how many pages were crawled before the
>    filtering criteria were applied in my utility method to know how
>    restrictive my filtering criteria is (like only 10% were indexed or
>    something)?
> >>
> you can use bin/nutch readdb command to print overall statistics for crawl
> database. I see in Indexer MapReduce phase, it will print some indexing
> status in Hadoop console, such as number of urls skipped by filters, numbr
> of documents add and error indexing urls etc.
>
>
>
>
> On Tue, Mar 11, 2014 at 3:56 AM, S.L <[email protected]> wrote:
>
> > Can anyone please answer these questions ?
> >
> >
> > On Wed, Mar 5, 2014 at 7:52 PM, S.L <[email protected]> wrote:
> >
> > > Marcus,
> > >
> > > Thanks! Please bear with me I have a few more questions..
> > >
> > >
> > >    1. I am confused when you mention that a no of cycles is determined
> by
> > >    the topN parameter , I was under the impression from reading the
> > >    documentation that topN determines the max number of links that are
> > going
> > >    to be included at each level , so as to not miss any link I gave an
> > >    arbitrary value of 30,000 at each level. Is my understanding
> correct?
> > If so
> > >    how can I reduce the delay that topN is causing ?
> > >    2. Apparently I am doing the exact same thing as you i.e indexing
>  in
> > >    the Fetcher.java by calling a utility method that takes the content
> > and
> > >    populates the index(Solr) and not using the indexing feature
> provided
> > by
> > >    Nutch . My understanding is that this is the Map phase of the Nutch
> > Job and
> > >    the Reduce phase for Nucth is only relevant to compute the outlinks
> > for
> > >    each URL , which I dont need because for me every URL /link is
> equally
> > >    important. The questions is that I still see Hadoop spending
> > significant
> > >    amount of time during the reduce job. If my understanding is correct
> > can I
> > >    disable the reduce phase for the Nutch job and how can I do so ?
> > >    3. When a page is crawled it is applied a certain criteria to
> > >    determine if its eligible for indexing , in my sample crawl of four
> > >    websites I end up with lesser than expected  number of documents in
> my
> > >    index , how do you suggest I implement a way to see how many pages
> > were
> > >    crawled before the filtering criteria were applied in my utility
> > method to
> > >    know how restrictive my filtering criteria is (like only 10% were
> > indexed
> > >    or something)?
> > >
> > >
> > > I would really appreciate if you could take the time to answer my
> > > questions or provide me any leads.
> > >
> > >
> > > Thanks in advance!
> > >
> > >
> > > On Tue, Mar 4, 2014 at 9:58 AM, Markus Jelsma <
> > [email protected]>wrote:
> > >
> > >> Yes, the console shows you what it is doing, stdout as well.
> > >> In your case is it the depth that makes it take so long, it does
> 30.000
> > >> crawl cycles. We do cycles of around 1000-2000 and that takes between
> 10
> > >> and 15 minutes and we skip the indexing job (we index in the Fetcher).
> > In
> > >> the end we do around 90-110 cycles every day so 30.000 would take us
> > almost
> > >> a year! :)
> > >>
> > >> If your crawler does not finish all its records before default or
> > >> adaptiveinterval, it won't stop for a long time! :)
> > >>
> > >> -----Original message-----
> > >> > From:S.L <[email protected]>
> > >> > Sent: Tuesday 4th March 2014 8:09
> > >> > To: [email protected]
> > >> > Subject: When can the Nutch MapReduce job be considered complete?
> > >> >
> > >> > Hi All,
> > >> >
> > >> > I have set up  a psuedo distributed cluster using Hadoop 2.3 and
> > runing
> > >> > Nutch 1.7 on it as a MapReduce Job and I use the following command
> to
> > >> > submit the job.
> > >> >
> > >> > /mnt/hadoop-2.3.0/bin/hadoop jar
> > >> > /opt/dfconfig/nutch/apache-nutch-1.8-SNAPSHOT.job
> > >> > org.apache.nutch.crawl.Crawl /urls -dir crawldirectory -depth 1000
> > -topN
> > >> > 30000
> > >> >
> > >> > I notice that the crawl is continuing even after 72 hours , I am
> only
> > >> > crawling 4 websites and have disabled outlinks to external domains .
> > >> Most
> > >> > of the pages are crawled for the first few hours but then the crawl
> > >> keeps
> > >> > on running and a very few pages are crawled in those extended crawl
> > >> > sessions. Is my high topN value causing this seemingly never ending
> > >> crawl ?
> > >> >
> > >> > How can I track the status ( from the Hadoop console  or otherwise)
> ?
> > >> >
> > >> > Thanks.
> > >> >
> > >>
> > >
> > >
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Re: When can the Nutch MapReduce job be considered complete?

Reply via email to