Hi,
I use nutch to crawl a website :http://www.sina.com.cn
The crawl process stop at depth 0, and only fetch the homepage of the
website.

My crawl crawl-urlfilter.txt is as follow:

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)sina.com.cn/

# skip everything else
-.

Have somebody an idea ?

-- 

Yizhong Zhuang
Beijing University of Posts and Telecommunications
Email:[email protected]
Myblog:www.mikkoo.info

Reply via email to