<< 1. I am confused when you mention that a no of cycles is determined by the topN parameter , I was under the impression from reading the documentation that topN determines the max number of links that are going to be included at each level , so as to not miss any link I gave an arbitrary value of 30,000 at each level. Is my understanding correct? If so how can I reduce the delay that topN is causing ? >> yes, you are right, the 30,000 is set to the number of top urls to be selected in each crawl segment.
<< 2. Apparently I am doing the exact same thing as you i.e indexing in the Fetcher.java by calling a utility method that takes the content and populates the index(Solr) and not using the indexing feature provided by Nutch . My understanding is that this is the Map phase of the Nutch Job and the Reduce phase for Nucth is only relevant to compute the outlinks for each URL , which I dont need because for me every URL /link is equally important. The questions is that I still see Hadoop spending significant amount of time during the reduce job. If my understanding is correct can I disable the reduce phase for the Nutch job and how can I do so ? >> I see the Fetcher class is a MapRunnable Implementation, it will not has Reduce phase. Which Reduce phase for Nutch do you ask? I see in Parse phase it will extract all urls in content and calculate the score for each url. << 3. When a page is crawled it is applied a certain criteria to determine if its eligible for indexing , in my sample crawl of four websites I end up with lesser than expected number of documents in my index , how do you suggest I implement a way to see how many pages were crawled before the filtering criteria were applied in my utility method to know how restrictive my filtering criteria is (like only 10% were indexed or something)? >> you can use bin/nutch readdb command to print overall statistics for crawl database. I see in Indexer MapReduce phase, it will print some indexing status in Hadoop console, such as number of urls skipped by filters, numbr of documents add and error indexing urls etc. On Tue, Mar 11, 2014 at 3:56 AM, S.L <[email protected]> wrote: > Can anyone please answer these questions ? > > > On Wed, Mar 5, 2014 at 7:52 PM, S.L <[email protected]> wrote: > > > Marcus, > > > > Thanks! Please bear with me I have a few more questions.. > > > > > > 1. I am confused when you mention that a no of cycles is determined by > > the topN parameter , I was under the impression from reading the > > documentation that topN determines the max number of links that are > going > > to be included at each level , so as to not miss any link I gave an > > arbitrary value of 30,000 at each level. Is my understanding correct? > If so > > how can I reduce the delay that topN is causing ? > > 2. Apparently I am doing the exact same thing as you i.e indexing in > > the Fetcher.java by calling a utility method that takes the content > and > > populates the index(Solr) and not using the indexing feature provided > by > > Nutch . My understanding is that this is the Map phase of the Nutch > Job and > > the Reduce phase for Nucth is only relevant to compute the outlinks > for > > each URL , which I dont need because for me every URL /link is equally > > important. The questions is that I still see Hadoop spending > significant > > amount of time during the reduce job. If my understanding is correct > can I > > disable the reduce phase for the Nutch job and how can I do so ? > > 3. When a page is crawled it is applied a certain criteria to > > determine if its eligible for indexing , in my sample crawl of four > > websites I end up with lesser than expected number of documents in my > > index , how do you suggest I implement a way to see how many pages > were > > crawled before the filtering criteria were applied in my utility > method to > > know how restrictive my filtering criteria is (like only 10% were > indexed > > or something)? > > > > > > I would really appreciate if you could take the time to answer my > > questions or provide me any leads. > > > > > > Thanks in advance! > > > > > > On Tue, Mar 4, 2014 at 9:58 AM, Markus Jelsma < > [email protected]>wrote: > > > >> Yes, the console shows you what it is doing, stdout as well. > >> In your case is it the depth that makes it take so long, it does 30.000 > >> crawl cycles. We do cycles of around 1000-2000 and that takes between 10 > >> and 15 minutes and we skip the indexing job (we index in the Fetcher). > In > >> the end we do around 90-110 cycles every day so 30.000 would take us > almost > >> a year! :) > >> > >> If your crawler does not finish all its records before default or > >> adaptiveinterval, it won't stop for a long time! :) > >> > >> -----Original message----- > >> > From:S.L <[email protected]> > >> > Sent: Tuesday 4th March 2014 8:09 > >> > To: [email protected] > >> > Subject: When can the Nutch MapReduce job be considered complete? > >> > > >> > Hi All, > >> > > >> > I have set up a psuedo distributed cluster using Hadoop 2.3 and > runing > >> > Nutch 1.7 on it as a MapReduce Job and I use the following command to > >> > submit the job. > >> > > >> > /mnt/hadoop-2.3.0/bin/hadoop jar > >> > /opt/dfconfig/nutch/apache-nutch-1.8-SNAPSHOT.job > >> > org.apache.nutch.crawl.Crawl /urls -dir crawldirectory -depth 1000 > -topN > >> > 30000 > >> > > >> > I notice that the crawl is continuing even after 72 hours , I am only > >> > crawling 4 websites and have disabled outlinks to external domains . > >> Most > >> > of the pages are crawled for the first few hours but then the crawl > >> keeps > >> > on running and a very few pages are crawled in those extended crawl > >> > sessions. Is my high topN value causing this seemingly never ending > >> crawl ? > >> > > >> > How can I track the status ( from the Hadoop console or otherwise) ? > >> > > >> > Thanks. > >> > > >> > > > > > -- Don't Grow Old, Grow Up... :-)

