also check out the robots.txt in sina.com.cn, maybe your agent is not
allowed by sina.


2011/5/17 Bupo Jung <[email protected]>:
> I tried you suggestion, but get the same result as before.
>
> 2011/5/15 ts egge <[email protected]>
>
>> I trink your regex doesn't allow more than the home Page.
>>
>> Try to extend your Domain by .*
>> +^http://([a-z0-9]*\.)sina.com.cn/.*
>>
>> Am 15.05.2011 11:05 schrieb "Bupo Jung" <[email protected]>:
>> > Hi,
>> > I use nutch to crawl a website :http://www.sina.com.cn
>> > The crawl process stop at depth 0, and only fetch the homepage of the
>> > website.
>> >
>> > My crawl crawl-urlfilter.txt is
>> > # accept hosts in MY.DOMAIN.NAME
>> > +^http://([a-z0-9]*\.)sina.com.cn/
>> >
>> > # skip everything else
>> > -.
>> >
>> > Have somebody an idea ?
>> >
>> > --
>> >
>> > Yizhong Zhuang
>> > Beijing University of Posts and Telecommunications
>> > Email:[email protected]
>> > Myblog:www.mikkoo.info
>>
>
>
>
> --
>
> Yizhong Zhuang
> Beijing University of Posts and Telecommunications
> Email:[email protected]
> Myblog:www.mikkoo.info
>

Reply via email to