Your page takes almost 15 seconds to load (at least here in Germany). As far
as I know Nutch waits 1000ms by default. See "http.max.delays".

Also I would check size of the pages and nutch's "file.content.limit" and
"http.content.limit".

Also I see several subdomains and don't know whether or not your regex is
covering this.

Am 17.05.2011 04:20 schrieb "黄淑明" <[email protected]>:
> also check out the robots.txt in sina.com.cn, maybe your agent is not
> allowed by sina.
>
>
> 2011/5/17 Bupo Jung <[email protected]>:
>> I tried you suggestion, but get the same result as before.
>>
>> 2011/5/15 ts egge <[email protected]>
>>
>>> I trink your regex doesn't allow more than the home Page.
>>>
>>> Try to extend your Domain by .*
>>> +^http://([a-z0-9]*\.)sina.com.cn/.*
>>>
>>> Am 15.05.2011 11:05 schrieb "Bupo Jung" <[email protected]>:
>>> > Hi,
>>> > I use nutch to crawl a website :http://www.sina.com.cn
>>> > The crawl process stop at depth 0, and only fetch the homepage of the
>>> > website.
>>> >
>>> > My crawl crawl-urlfilter.txt is
>>> > # accept hosts in MY.DOMAIN.NAME
>>> > +^http://([a-z0-9]*\.)sina.com.cn/
>>> >
>>> > # skip everything else
>>> > -.
>>> >
>>> > Have somebody an idea ?
>>> >
>>> > --
>>> >
>>> > Yizhong Zhuang
>>> > Beijing University of Posts and Telecommunications
>>> > Email:[email protected]
>>> > Myblog:www.mikkoo.info
>>>
>>
>>
>>
>> --
>>
>> Yizhong Zhuang
>> Beijing University of Posts and Telecommunications
>> Email:[email protected]
>> Myblog:www.mikkoo.info
>>

Reply via email to