Hi!
I am trying to crawl a forum, and I'm getting a strange behavior:
I have defined as URL filters:
-^(file|ftp|mailto):
-\.gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-...@#]
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
+^http://<forum_domain>(.)*
-.
And, if I define seeds like (in the "*urls.txt*" file):
http://<forum_domain>/viewtopic=<topic_id>
http://<forum_domain>/viewforum=<forum_id> or
http://<forum_domain>
at the end of the crawling process Crawldb has just one url, the seed, in a
"unfeched" status.
Everything in logs looks like OK for these three (seed) cases, no error
traces, but if I check db status
("*nutch readdb <crawldb_cluster_location> -stats*") at the end of the
execution, I get something similar to:
* * TOTAL urls: 1
retry 2: 1
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 1
* CrawlDb statistics: done*
*Crawl.log* content:
Injector: starting
Injector: crawlDb: <cluster_location>/crawl/crawldata/crawl/crawldb
Injector: urlDir: <cluster_location>/crawl/crawlurls/urls.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment:
<cluster_location>/crawl/crawldata/crawl/segments/20101018171814
Generator: filtering: true
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment:
<cluster_location>/crawl/crawldata/crawl/segments/20101018171814
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: <cluster_location>/crawl/crawldata/crawl/crawldb
CrawlDb update: segments:
[<cluster_location>/crawl/crawldata/crawl/segments/20101018171814]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: done
The *nutch_site.xml* file has host and port defined correctly ("*
http.proxy.host*" and "*http.proxy.port*" properties).
Can somebody lend me a hand, please?
Thanks in advance.