Can anyone please answer these questions ?
On Wed, Mar 5, 2014 at 7:52 PM, S.L <[email protected]> wrote: > Marcus, > > Thanks! Please bear with me I have a few more questions.. > > > 1. I am confused when you mention that a no of cycles is determined by > the topN parameter , I was under the impression from reading the > documentation that topN determines the max number of links that are going > to be included at each level , so as to not miss any link I gave an > arbitrary value of 30,000 at each level. Is my understanding correct? If so > how can I reduce the delay that topN is causing ? > 2. Apparently I am doing the exact same thing as you i.e indexing in > the Fetcher.java by calling a utility method that takes the content and > populates the index(Solr) and not using the indexing feature provided by > Nutch . My understanding is that this is the Map phase of the Nutch Job and > the Reduce phase for Nucth is only relevant to compute the outlinks for > each URL , which I dont need because for me every URL /link is equally > important. The questions is that I still see Hadoop spending significant > amount of time during the reduce job. If my understanding is correct can I > disable the reduce phase for the Nutch job and how can I do so ? > 3. When a page is crawled it is applied a certain criteria to > determine if its eligible for indexing , in my sample crawl of four > websites I end up with lesser than expected number of documents in my > index , how do you suggest I implement a way to see how many pages were > crawled before the filtering criteria were applied in my utility method to > know how restrictive my filtering criteria is (like only 10% were indexed > or something)? > > > I would really appreciate if you could take the time to answer my > questions or provide me any leads. > > > Thanks in advance! > > > On Tue, Mar 4, 2014 at 9:58 AM, Markus Jelsma > <[email protected]>wrote: > >> Yes, the console shows you what it is doing, stdout as well. >> In your case is it the depth that makes it take so long, it does 30.000 >> crawl cycles. We do cycles of around 1000-2000 and that takes between 10 >> and 15 minutes and we skip the indexing job (we index in the Fetcher). In >> the end we do around 90-110 cycles every day so 30.000 would take us almost >> a year! :) >> >> If your crawler does not finish all its records before default or >> adaptiveinterval, it won't stop for a long time! :) >> >> -----Original message----- >> > From:S.L <[email protected]> >> > Sent: Tuesday 4th March 2014 8:09 >> > To: [email protected] >> > Subject: When can the Nutch MapReduce job be considered complete? >> > >> > Hi All, >> > >> > I have set up a psuedo distributed cluster using Hadoop 2.3 and runing >> > Nutch 1.7 on it as a MapReduce Job and I use the following command to >> > submit the job. >> > >> > /mnt/hadoop-2.3.0/bin/hadoop jar >> > /opt/dfconfig/nutch/apache-nutch-1.8-SNAPSHOT.job >> > org.apache.nutch.crawl.Crawl /urls -dir crawldirectory -depth 1000 -topN >> > 30000 >> > >> > I notice that the crawl is continuing even after 72 hours , I am only >> > crawling 4 websites and have disabled outlinks to external domains . >> Most >> > of the pages are crawled for the first few hours but then the crawl >> keeps >> > on running and a very few pages are crawled in those extended crawl >> > sessions. Is my high topN value causing this seemingly never ending >> crawl ? >> > >> > How can I track the status ( from the Hadoop console or otherwise) ? >> > >> > Thanks. >> > >> > >

