Thanks feng , I will use the readdb command.
On Mon, Mar 10, 2014 at 10:13 PM, feng lu <[email protected]> wrote: > << > 1. I am confused when you mention that a no of cycles is determined by > the topN parameter , I was under the impression from reading the > documentation that topN determines the max number of links that are > going > to be included at each level , so as to not miss any link I gave an > arbitrary value of 30,000 at each level. Is my understanding correct? If > so > how can I reduce the delay that topN is causing ? > >> > yes, you are right, the 30,000 is set to the number of top urls to be > selected in each crawl segment. > > << > 2. Apparently I am doing the exact same thing as you i.e indexing in > the Fetcher.java by calling a utility method that takes the content and > populates the index(Solr) and not using the indexing feature provided by > Nutch . My understanding is that this is the Map phase of the Nutch Job > and > the Reduce phase for Nucth is only relevant to compute the outlinks for > each URL , which I dont need because for me every URL /link is equally > important. The questions is that I still see Hadoop spending significant > amount of time during the reduce job. If my understanding is correct can > I > disable the reduce phase for the Nutch job and how can I do so ? > >> > I see the Fetcher class is a MapRunnable Implementation, it will not has > Reduce phase. Which Reduce phase for Nutch do you ask? I see in Parse phase > it will extract all urls in content and calculate the score for each url. > > << > 3. When a page is crawled it is applied a certain criteria to determine > if its eligible for indexing , in my sample crawl of four websites I end > up > with lesser than expected number of documents in my index , how do you > suggest I implement a way to see how many pages were crawled before the > filtering criteria were applied in my utility method to know how > restrictive my filtering criteria is (like only 10% were indexed or > something)? > >> > you can use bin/nutch readdb command to print overall statistics for crawl > database. I see in Indexer MapReduce phase, it will print some indexing > status in Hadoop console, such as number of urls skipped by filters, numbr > of documents add and error indexing urls etc. > > > > > On Tue, Mar 11, 2014 at 3:56 AM, S.L <[email protected]> wrote: > > > Can anyone please answer these questions ? > > > > > > On Wed, Mar 5, 2014 at 7:52 PM, S.L <[email protected]> wrote: > > > > > Marcus, > > > > > > Thanks! Please bear with me I have a few more questions.. > > > > > > > > > 1. I am confused when you mention that a no of cycles is determined > by > > > the topN parameter , I was under the impression from reading the > > > documentation that topN determines the max number of links that are > > going > > > to be included at each level , so as to not miss any link I gave an > > > arbitrary value of 30,000 at each level. Is my understanding > correct? > > If so > > > how can I reduce the delay that topN is causing ? > > > 2. Apparently I am doing the exact same thing as you i.e indexing > in > > > the Fetcher.java by calling a utility method that takes the content > > and > > > populates the index(Solr) and not using the indexing feature > provided > > by > > > Nutch . My understanding is that this is the Map phase of the Nutch > > Job and > > > the Reduce phase for Nucth is only relevant to compute the outlinks > > for > > > each URL , which I dont need because for me every URL /link is > equally > > > important. The questions is that I still see Hadoop spending > > significant > > > amount of time during the reduce job. If my understanding is correct > > can I > > > disable the reduce phase for the Nutch job and how can I do so ? > > > 3. When a page is crawled it is applied a certain criteria to > > > determine if its eligible for indexing , in my sample crawl of four > > > websites I end up with lesser than expected number of documents in > my > > > index , how do you suggest I implement a way to see how many pages > > were > > > crawled before the filtering criteria were applied in my utility > > method to > > > know how restrictive my filtering criteria is (like only 10% were > > indexed > > > or something)? > > > > > > > > > I would really appreciate if you could take the time to answer my > > > questions or provide me any leads. > > > > > > > > > Thanks in advance! > > > > > > > > > On Tue, Mar 4, 2014 at 9:58 AM, Markus Jelsma < > > [email protected]>wrote: > > > > > >> Yes, the console shows you what it is doing, stdout as well. > > >> In your case is it the depth that makes it take so long, it does > 30.000 > > >> crawl cycles. We do cycles of around 1000-2000 and that takes between > 10 > > >> and 15 minutes and we skip the indexing job (we index in the Fetcher). > > In > > >> the end we do around 90-110 cycles every day so 30.000 would take us > > almost > > >> a year! :) > > >> > > >> If your crawler does not finish all its records before default or > > >> adaptiveinterval, it won't stop for a long time! :) > > >> > > >> -----Original message----- > > >> > From:S.L <[email protected]> > > >> > Sent: Tuesday 4th March 2014 8:09 > > >> > To: [email protected] > > >> > Subject: When can the Nutch MapReduce job be considered complete? > > >> > > > >> > Hi All, > > >> > > > >> > I have set up a psuedo distributed cluster using Hadoop 2.3 and > > runing > > >> > Nutch 1.7 on it as a MapReduce Job and I use the following command > to > > >> > submit the job. > > >> > > > >> > /mnt/hadoop-2.3.0/bin/hadoop jar > > >> > /opt/dfconfig/nutch/apache-nutch-1.8-SNAPSHOT.job > > >> > org.apache.nutch.crawl.Crawl /urls -dir crawldirectory -depth 1000 > > -topN > > >> > 30000 > > >> > > > >> > I notice that the crawl is continuing even after 72 hours , I am > only > > >> > crawling 4 websites and have disabled outlinks to external domains . > > >> Most > > >> > of the pages are crawled for the first few hours but then the crawl > > >> keeps > > >> > on running and a very few pages are crawled in those extended crawl > > >> > sessions. Is my high topN value causing this seemingly never ending > > >> crawl ? > > >> > > > >> > How can I track the status ( from the Hadoop console or otherwise) > ? > > >> > > > >> > Thanks. > > >> > > > >> > > > > > > > > > > > > -- > Don't Grow Old, Grow Up... :-) >

