Problem with regex url filter

Paul Rogers Mon, 05 May 2014 08:35:22 -0700

Hi Guys

I am trying to crawl a local intranet (currently a test) over http (with
the binary version of nutch 1.8).  The site is accessible at
http://localhost/ and my documents are in a directory called pcedocs, ie
are accessible at http://localhost/pcedocs/.


The documents currently in the folder are as follows:

-rw-r--r--. 1 root root    884 May  4 16:41 index1.html
-rw-r--r--. 1 root root    882 May  4 16:00 index.html
-rw-r--r--. 1 root root   2072 May  4 16:01 light_button.png
-rw-r--r--. 1 root root  35431 May  4 16:01 light_logo.png
-rw-r--r--. 1 root root    103 May  4 16:01 poweredby.png

so I'm expecting the two html docs to picked up by the crawl.

My urls/seed.txt is as follows:

http://localhost/pcedocs/

My regex-urlfilter.txt is unchanged from the original except for the
following lines:

# accept anything else
+.*/pcedocs/.*

I have also tried replacing "+.*/pcedocs/.*" with:

+^http://([a-z0-9]*\.)*localhost/

In both instances the crawl gives the following output:

Injector: starting at 2014-04-20 16:55:32
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: finished at 2014-04-20 16:55:35, elapsed: 00:00:02
Sun 20 Apr 16:55:35 EST 2014 : Iteration 1 of 2
Generating a new segment
Generator: starting at 2014-04-20 16:55:36
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20140420165538
Generator: finished at 2014-04-20 16:55:40, elapsed: 00:00:03
Operating on segment : 20140420165538
Fetching : 20140420165538
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2014-04-20 16:55:40
Fetcher: segment: crawl/segments/20140420165538
Fetcher Timelimit set for : 1398041740834
Using queue mode : byHost
Fetcher: threads: 50
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://localhost/pcedocs/ (queue crawl delay=5000ms)
.
.
.

Checking the Solr index shows that only one document has been indexed
(index.html) and that it's url is http://localhost/pcedocs/.

What I'm expecting is for the crawl to produce two valid urls:

http://localhost/pcedocs/index.html
http://localhost/pcedocs/index1.html

(and as more documents are added more urls).

My question is how do I get nutch to crawl all the files on a web site not
just the "root" url?

Is it a problem with my urlfilter or some other config?

I'm missing something basic here but can't for the life of me figure out
what.

Any help'd be much appreciated.

Cheers

P

Problem with regex url filter

Reply via email to