Marcus, Thanks! Please bear with me I have a few more questions..
1. I am confused when you mention that a no of cycles is determined by the topN parameter , I was under the impression from reading the documentation that topN determines the max number of links that are going to be included at each level , so as to not miss any link I gave an arbitrary value of 30,000 at each level. Is my understanding correct? If so how can I reduce the delay that topN is causing ? 2. Apparently I am doing the exact same thing as you i.e indexing in the Fetcher.java by calling a utility method that takes the content and populates the index(Solr) and not using the indexing feature provided by Nutch . My understanding is that this is the Map phase of the Nutch Job and the Reduce phase for Nucth is only relevant to compute the outlinks for each URL , which I dont need because for me every URL /link is equally important. The questions is that I still see Hadoop spending significant amount of time during the reduce job. If my understanding is correct can I disable the reduce phase for the Nutch job and how can I do so ? 3. When a page is crawled it is applied a certain criteria to determine if its eligible for indexing , in my sample crawl of four websites I end up with lesser than expected number of documents in my index , how do you suggest I implement a way to see how many pages were crawled before the filtering criteria were applied in my utility method to know how restrictive my filtering criteria is (like only 10% were indexed or something)? I would really appreciate if you could take the time to answer my questions or provide me any leads. Thanks in advance! On Tue, Mar 4, 2014 at 9:58 AM, Markus Jelsma <[email protected]>wrote: > Yes, the console shows you what it is doing, stdout as well. > In your case is it the depth that makes it take so long, it does 30.000 > crawl cycles. We do cycles of around 1000-2000 and that takes between 10 > and 15 minutes and we skip the indexing job (we index in the Fetcher). In > the end we do around 90-110 cycles every day so 30.000 would take us almost > a year! :) > > If your crawler does not finish all its records before default or > adaptiveinterval, it won't stop for a long time! :) > > -----Original message----- > > From:S.L <[email protected]> > > Sent: Tuesday 4th March 2014 8:09 > > To: [email protected] > > Subject: When can the Nutch MapReduce job be considered complete? > > > > Hi All, > > > > I have set up a psuedo distributed cluster using Hadoop 2.3 and runing > > Nutch 1.7 on it as a MapReduce Job and I use the following command to > > submit the job. > > > > /mnt/hadoop-2.3.0/bin/hadoop jar > > /opt/dfconfig/nutch/apache-nutch-1.8-SNAPSHOT.job > > org.apache.nutch.crawl.Crawl /urls -dir crawldirectory -depth 1000 -topN > > 30000 > > > > I notice that the crawl is continuing even after 72 hours , I am only > > crawling 4 websites and have disabled outlinks to external domains . Most > > of the pages are crawled for the first few hours but then the crawl keeps > > on running and a very few pages are crawled in those extended crawl > > sessions. Is my high topN value causing this seemingly never ending > crawl ? > > > > How can I track the status ( from the Hadoop console or otherwise) ? > > > > Thanks. > > >

