Hi Guys I am trying to crawl a local intranet (currently a test) over http (with the binary version of nutch 1.8). The site is accessible at http://localhost/ and my documents are in a directory called pcedocs, ie are accessible at http://localhost/pcedocs/.
The documents currently in the folder are as follows: -rw-r--r--. 1 root root 884 May 4 16:41 index1.html -rw-r--r--. 1 root root 882 May 4 16:00 index.html -rw-r--r--. 1 root root 2072 May 4 16:01 light_button.png -rw-r--r--. 1 root root 35431 May 4 16:01 light_logo.png -rw-r--r--. 1 root root 103 May 4 16:01 poweredby.png so I'm expecting the two html docs to picked up by the crawl. My urls/seed.txt is as follows: http://localhost/pcedocs/ My regex-urlfilter.txt is unchanged from the original except for the following lines: # accept anything else +.*/pcedocs/.* I have also tried replacing "+.*/pcedocs/.*" with: +^http://([a-z0-9]*\.)*localhost/ In both instances the crawl gives the following output: Injector: starting at 2014-04-20 16:55:32 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 0 Injector: total number of urls injected after normalization and filtering: 1 Injector: Merging injected urls into crawl db. Injector: overwrite: false Injector: update: false Injector: finished at 2014-04-20 16:55:35, elapsed: 00:00:02 Sun 20 Apr 16:55:35 EST 2014 : Iteration 1 of 2 Generating a new segment Generator: starting at 2014-04-20 16:55:36 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: false Generator: normalizing: true Generator: topN: 50000 Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20140420165538 Generator: finished at 2014-04-20 16:55:40, elapsed: 00:00:03 Operating on segment : 20140420165538 Fetching : 20140420165538 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2014-04-20 16:55:40 Fetcher: segment: crawl/segments/20140420165538 Fetcher Timelimit set for : 1398041740834 Using queue mode : byHost Fetcher: threads: 50 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost fetching http://localhost/pcedocs/ (queue crawl delay=5000ms) . . . Checking the Solr index shows that only one document has been indexed (index.html) and that it's url is http://localhost/pcedocs/. What I'm expecting is for the crawl to produce two valid urls: http://localhost/pcedocs/index.html http://localhost/pcedocs/index1.html (and as more documents are added more urls). My question is how do I get nutch to crawl all the files on a web site not just the "root" url? Is it a problem with my urlfilter or some other config? I'm missing something basic here but can't for the life of me figure out what. Any help'd be much appreciated. Cheers P

