Hi, I use nutch to crawl a website :http://www.sina.com.cn The crawl process stop at depth 0, and only fetch the homepage of the website.
My crawl crawl-urlfilter.txt is as follow: # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)sina.com.cn/ # skip everything else -. Have somebody an idea ? -- Yizhong Zhuang Beijing University of Posts and Telecommunications Email:[email protected] Myblog:www.mikkoo.info

