Hi Tejas, Thank you very much for the detailed reply.
Please find the observations embedded in the email : *A. The main url [1] does NOT gets crawled. * This can happen due to some regex mismatch, or expception while crawling the url. One naive way to forget about regex rules is to simply add "+." at the start of the regex rules file. It will start accepting any url. Done Now run a fresh crawl and see if the url is getting fetched or not. How to check ? Use "bin/nutch CrawlDbReader" command. Command used is : bin/nutch readdb crawlnewtest -stats CrawlDb statistics start: crawlnewtest Statistics for CrawlDb: crawlnewtest TOTAL urls: 1 retry 0: 1 min score: 1.0 avg score: 1.0 max score: 1.0 status 1 (db_unfetched): 1 CrawlDb statistics: done db_unfetched - status is 1. . Log has only info and warning. No errors WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. If the status shown is db_unfetched, then check for logs which can provide some exceptions responsible behind the failiure. If it was fetched, then goto B. Does the above status mean that url[1] is crawled? If that is the case, why is it not indexed to solr? The nutch command I have used is : bin/nutch crawl urls -dir crawlnewtest -solr http://localhost:8080/solrnutch -depth 3 -topN 5 I do not see any error logs other than the warns that I have mentioned above. Yet to follow the last few steps of reading segments. *B. The main url gets crawled successfully but the rest 2 child pages are not getting crawled.* If the url [1] is db_fetched, then use the same command "bin/nutch CrawlDbReader" to see the status of the rest 2 child pages. If they are both db_unfetched, then there was some exception which caused the issue. See logs for details. If the child pages are not found in the DB, then there was some issue with link extraction ie. getting the child links from the content of the main page. Read the segment for the first round of crawl (which will have the content of the main page). Extract the content of the main page using the "bin/nutch SegmentReader" command. Check if the content fetched has the child urls in it. If yes, then the issue is with link extraction. On Mon, Dec 17, 2012 at 2:13 PM, Tejas Patil <[email protected]>wrote: > Lets break down the possibilities: > > *A. The main url [1] does NOT gets crawled. * > This can happen due to some regex mismatch, or expception while crawling > the url. > One naive way to forget about regex rules is to simply add "+." at the > start of the regex rules file. It will start accepting any url. > Now run a fresh crawl and see if the url is getting fetched or not. How to > check ? Use "bin/nutch CrawlDbReader" command. > If the status shown is db_unfetched, then check for logs which can provide > some exceptions responsible behind the failiure. > If it was fetched, then goto B. > > *B. The main url gets crawled successfully but the rest 2 child pages are > not getting crawled.* > If the url [1] is db_fetched, then use the same > command "bin/nutch CrawlDbReader" to see the status of the rest 2 child > pages. > If they are both db_unfetched, then there was some exception which caused > the issue. See logs for details. > > If the child pages are not found in the DB, then there was some issue with > link extraction ie. getting the child links from the content of the main > page. > Read the segment for the first round of crawl (which will have the content > of the main page). Extract the content of the main page using the > "bin/nutch SegmentReader" command. Check if the content fetched has the > child urls in it. If yes, then the issue is with link extraction. > > Please do this and revert back to this group with your observations. > > Thanks, > Tejas Patil > > > [1] : http://43.44.111.123:8080/nutch-test-site/ch-1.html > > > On Sun, Dec 16, 2012 at 9:48 PM, Rajani Maski <[email protected]> > wrote: > > > Hi users, > > > > I am trying to crawl the web applications running on the local apache > > tomcat webserver. Note : tomcat version 7, running on 8080 port. > > > > > > The Main html page is : > > http://43.44.111.123:8080/nutch-test-site/ch-1.html. > > This main page is having an hyperlink to call its sub child - > > http://43.44.111.123:8080/nutch-test-site/ch1/ch1-1.html > > and the sub-child is again having its own child as hyperlink - > > http://43.44.111.123:8080/nutch-test-site/ch2/ch2-2.html > > > > > > Now *I would like to know what is the filter that has to be given in > > regex-url-filter.txt to accept crawling for this site*. > > Because I am getting log as No more urls to fetch. This seems to be > mistake > > in my regex-urlfilter.txt or seed.txt > > > > I tried with the following cases setup: > > > > *Case 1* > > regex-urlfilter.txt - > > # accept anything else > > +^http://43.44.111.123:8080/nutch-test-site/child-1.html > > > > seed.txt - > > http://43.44.111.123:8080/nutch-test-site/child-1.html > > > > > > *Case 2* > > regex-urlfilter.txt - > > # accept anything else > > +^http://43.44.111.123:8080/ > > > > seed.txt - > > http://43.44.111.123:8080/nutch-test-site/child-1.html > > > > > > *Case 3* > > regex-urlfilter.txt - > > # accept anything else > > +^http://43.44.111.123:8080/ > > > > seed.txt - > > http://43.44.111.123:8080/ > > > > > > Output : Stopping at depth=1 - no more URLs to fetch. > > > > > > *Nutch command: * > > * bin/nutch crawl urls -dir tomcatcrawl -solr > > http://localhost:8080/solrnutch -depth 3 -topN 5 * > > * > > * > > * > > * > > Can you please point me out the mistake here.? > > > > Regards > > Rajani. > > >

