Hi users,

   I am trying to crawl the web applications running on the local apache
tomcat webserver. Note : tomcat version 7, running on 8080 port.


The Main html page is : http://43.44.111.123:8080/nutch-test-site/ch-1.html.
This main page is having an hyperlink to call its sub child  -
http://43.44.111.123:8080/nutch-test-site/ch1/ch1-1.html
and the sub-child is again having its own child as hyperlink   -
http://43.44.111.123:8080/nutch-test-site/ch2/ch2-2.html


Now *I would like to know what is the filter that has to be given in
regex-url-filter.txt to accept crawling for this site*.
Because I am getting log as No more urls to fetch. This seems to be mistake
in my regex-urlfilter.txt or seed.txt

I tried with the following cases setup:

*Case 1*
regex-urlfilter.txt  -
   # accept anything else
   +^http://43.44.111.123:8080/nutch-test-site/child-1.html

seed.txt -
  http://43.44.111.123:8080/nutch-test-site/child-1.html


*Case 2*
regex-urlfilter.txt  -
   # accept anything else
   +^http://43.44.111.123:8080/

seed.txt -
  http://43.44.111.123:8080/nutch-test-site/child-1.html


*Case 3*
regex-urlfilter.txt  -
   # accept anything else
   +^http://43.44.111.123:8080/

seed.txt -
  http://43.44.111.123:8080/


Output : Stopping at depth=1 - no more URLs to fetch.


*Nutch command: *
* bin/nutch crawl urls -dir tomcatcrawl -solr
http://localhost:8080/solrnutch -depth 3 -topN 5 *
*
*
*
*
Can you please point me out the mistake here.?

Regards
Rajani.

Reply via email to