Hi users, I am trying to crawl the web applications running on the local apache tomcat webserver. Note : tomcat version 7, running on 8080 port.
The Main html page is : http://43.44.111.123:8080/nutch-test-site/ch-1.html. This main page is having an hyperlink to call its sub child - http://43.44.111.123:8080/nutch-test-site/ch1/ch1-1.html and the sub-child is again having its own child as hyperlink - http://43.44.111.123:8080/nutch-test-site/ch2/ch2-2.html Now *I would like to know what is the filter that has to be given in regex-url-filter.txt to accept crawling for this site*. Because I am getting log as No more urls to fetch. This seems to be mistake in my regex-urlfilter.txt or seed.txt I tried with the following cases setup: *Case 1* regex-urlfilter.txt - # accept anything else +^http://43.44.111.123:8080/nutch-test-site/child-1.html seed.txt - http://43.44.111.123:8080/nutch-test-site/child-1.html *Case 2* regex-urlfilter.txt - # accept anything else +^http://43.44.111.123:8080/ seed.txt - http://43.44.111.123:8080/nutch-test-site/child-1.html *Case 3* regex-urlfilter.txt - # accept anything else +^http://43.44.111.123:8080/ seed.txt - http://43.44.111.123:8080/ Output : Stopping at depth=1 - no more URLs to fetch. *Nutch command: * * bin/nutch crawl urls -dir tomcatcrawl -solr http://localhost:8080/solrnutch -depth 3 -topN 5 * * * * * Can you please point me out the mistake here.? Regards Rajani.

