Marcus,

Thanks! Please bear with me I have a few more questions..


   1. I am confused when you mention that a no of cycles is determined by
   the topN parameter , I was under the impression from reading the
   documentation that topN determines the max number of links that are going
   to be included at each level , so as to not miss any link I gave an
   arbitrary value of 30,000 at each level. Is my understanding correct? If so
   how can I reduce the delay that topN is causing ?
   2. Apparently I am doing the exact same thing as you i.e indexing  in
   the Fetcher.java by calling a utility method that takes the content and
   populates the index(Solr) and not using the indexing feature provided by
   Nutch . My understanding is that this is the Map phase of the Nutch Job and
   the Reduce phase for Nucth is only relevant to compute the outlinks for
   each URL , which I dont need because for me every URL /link is equally
   important. The questions is that I still see Hadoop spending significant
   amount of time during the reduce job. If my understanding is correct can I
   disable the reduce phase for the Nutch job and how can I do so ?
   3. When a page is crawled it is applied a certain criteria to determine
   if its eligible for indexing , in my sample crawl of four websites I end up
   with lesser than expected  number of documents in my index , how do you
   suggest I implement a way to see how many pages were crawled before the
   filtering criteria were applied in my utility method to know how
   restrictive my filtering criteria is (like only 10% were indexed or
   something)?


I would really appreciate if you could take the time to answer my questions
or provide me any leads.


Thanks in advance!


On Tue, Mar 4, 2014 at 9:58 AM, Markus Jelsma <[email protected]>wrote:

> Yes, the console shows you what it is doing, stdout as well.
> In your case is it the depth that makes it take so long, it does 30.000
> crawl cycles. We do cycles of around 1000-2000 and that takes between 10
> and 15 minutes and we skip the indexing job (we index in the Fetcher). In
> the end we do around 90-110 cycles every day so 30.000 would take us almost
> a year! :)
>
> If your crawler does not finish all its records before default or
> adaptiveinterval, it won't stop for a long time! :)
>
> -----Original message-----
> > From:S.L <[email protected]>
> > Sent: Tuesday 4th March 2014 8:09
> > To: [email protected]
> > Subject: When can the Nutch MapReduce job be considered complete?
> >
> > Hi All,
> >
> > I have set up  a psuedo distributed cluster using Hadoop 2.3 and runing
> > Nutch 1.7 on it as a MapReduce Job and I use the following command to
> > submit the job.
> >
> > /mnt/hadoop-2.3.0/bin/hadoop jar
> > /opt/dfconfig/nutch/apache-nutch-1.8-SNAPSHOT.job
> > org.apache.nutch.crawl.Crawl /urls -dir crawldirectory -depth 1000 -topN
> > 30000
> >
> > I notice that the crawl is continuing even after 72 hours , I am only
> > crawling 4 websites and have disabled outlinks to external domains . Most
> > of the pages are crawled for the first few hours but then the crawl keeps
> > on running and a very few pages are crawled in those extended crawl
> > sessions. Is my high topN value causing this seemingly never ending
> crawl ?
> >
> > How can I track the status ( from the Hadoop console  or otherwise) ?
> >
> > Thanks.
> >
>

Reply via email to