also check out the robots.txt in sina.com.cn, maybe your agent is not allowed by sina.
2011/5/17 Bupo Jung <[email protected]>: > I tried you suggestion, but get the same result as before. > > 2011/5/15 ts egge <[email protected]> > >> I trink your regex doesn't allow more than the home Page. >> >> Try to extend your Domain by .* >> +^http://([a-z0-9]*\.)sina.com.cn/.* >> >> Am 15.05.2011 11:05 schrieb "Bupo Jung" <[email protected]>: >> > Hi, >> > I use nutch to crawl a website :http://www.sina.com.cn >> > The crawl process stop at depth 0, and only fetch the homepage of the >> > website. >> > >> > My crawl crawl-urlfilter.txt is >> > # accept hosts in MY.DOMAIN.NAME >> > +^http://([a-z0-9]*\.)sina.com.cn/ >> > >> > # skip everything else >> > -. >> > >> > Have somebody an idea ? >> > >> > -- >> > >> > Yizhong Zhuang >> > Beijing University of Posts and Telecommunications >> > Email:[email protected] >> > Myblog:www.mikkoo.info >> > > > > -- > > Yizhong Zhuang > Beijing University of Posts and Telecommunications > Email:[email protected] > Myblog:www.mikkoo.info >

