Your page takes almost 15 seconds to load (at least here in Germany). As far as I know Nutch waits 1000ms by default. See "http.max.delays".
Also I would check size of the pages and nutch's "file.content.limit" and "http.content.limit". Also I see several subdomains and don't know whether or not your regex is covering this. Am 17.05.2011 04:20 schrieb "黄淑明" <[email protected]>: > also check out the robots.txt in sina.com.cn, maybe your agent is not > allowed by sina. > > > 2011/5/17 Bupo Jung <[email protected]>: >> I tried you suggestion, but get the same result as before. >> >> 2011/5/15 ts egge <[email protected]> >> >>> I trink your regex doesn't allow more than the home Page. >>> >>> Try to extend your Domain by .* >>> +^http://([a-z0-9]*\.)sina.com.cn/.* >>> >>> Am 15.05.2011 11:05 schrieb "Bupo Jung" <[email protected]>: >>> > Hi, >>> > I use nutch to crawl a website :http://www.sina.com.cn >>> > The crawl process stop at depth 0, and only fetch the homepage of the >>> > website. >>> > >>> > My crawl crawl-urlfilter.txt is >>> > # accept hosts in MY.DOMAIN.NAME >>> > +^http://([a-z0-9]*\.)sina.com.cn/ >>> > >>> > # skip everything else >>> > -. >>> > >>> > Have somebody an idea ? >>> > >>> > -- >>> > >>> > Yizhong Zhuang >>> > Beijing University of Posts and Telecommunications >>> > Email:[email protected] >>> > Myblog:www.mikkoo.info >>> >> >> >> >> -- >> >> Yizhong Zhuang >> Beijing University of Posts and Telecommunications >> Email:[email protected] >> Myblog:www.mikkoo.info >>

