Hi!

I am trying to crawl a forum, and I'm getting a strange behavior:

I have defined as URL filters:

-^(file|ftp|mailto):
-\.gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-...@#]
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
+^http://<forum_domain>(.)*
-.


And,  if I define seeds like (in the "*urls.txt*" file):

       http://<forum_domain>/viewtopic=<topic_id>
        http://<forum_domain>/viewforum=<forum_id>   or
       http://<forum_domain>

at the end of the crawling process Crawldb has just one url, the seed, in a
"unfeched" status.


Everything in logs looks like OK for these three (seed) cases, no error
traces, but if I check db status
("*nutch readdb <crawldb_cluster_location> -stats*") at the end of the
execution, I get something similar to:

*      * TOTAL urls:     1
        retry 2:        1
        min score:      1.0
        avg score:      1.0
        max score:      1.0
        status 1 (db_unfetched): 1
*        CrawlDb statistics: done*

*Crawl.log* content:

       Injector: starting
        Injector: crawlDb: <cluster_location>/crawl/crawldata/crawl/crawldb
        Injector: urlDir: <cluster_location>/crawl/crawlurls/urls.txt
        Injector: Converting injected urls to crawl db entries.
        Injector: Merging injected urls into crawl db.
        Injector: done

        Generator: Selecting best-scoring urls due for fetch.
        Generator: starting
        Generator: segment:
<cluster_location>/crawl/crawldata/crawl/segments/20101018171814
        Generator: filtering: true
        Generator: Partitioning selected urls by host, for politeness.
        Generator: done.

        Fetcher: starting
        Fetcher: segment:
<cluster_location>/crawl/crawldata/crawl/segments/20101018171814
        Fetcher: done

        CrawlDb update: starting
        CrawlDb update: db: <cluster_location>/crawl/crawldata/crawl/crawldb
        CrawlDb update: segments:
[<cluster_location>/crawl/crawldata/crawl/segments/20101018171814]
        CrawlDb update: additions allowed: true
        CrawlDb update: URL normalizing: false
        CrawlDb update: URL filtering: false
        CrawlDb update: Merging segment data into db.
        CrawlDb update: done



The *nutch_site.xml* file has host and port defined correctly ("*
http.proxy.host*" and "*http.proxy.port*" properties).

Can somebody lend me a hand, please?
Thanks in advance.

Reply via email to