Re: Length of downloaded pages

Fabio Ricci Sun, 16 Apr 2017 07:25:25 -0700

Hello Sazedul

Thank you for your hint - indeed I was hoping it would be so as you said.
I am using the url http://amwmg.com/ <http://amwmg.com/> for tests, this is a 
quite long page.


Unfortunately even after having changed - in nutch-site.xml -  the value of 
http.content.limit to -1 a truncation occur. 
The same happened even with a value of 5000000 …
(So it seems I have to download url contents by myself… )

Thanks a lot anyway!
Fabio


> On 16 Apr 2017, at 15:50, Sazedul Islam <[email protected]> wrote:
> 
> Yes, there is a way to download webpages without truncating. Just put
> http.content.limit in the nutch-site.xml file with the value -1.
> 
> <property>  <name>http.content.limit</name>  <value>-1</value>
> <description>The length limit for downloaded content, in bytes.  If
> this value is nonnegative (>=0), content longer than it will be
> truncated;  otherwise, no truncation at all.
> </description></property>
> 
> 
> On Sun, Apr 16, 2017 at 7:34 PM Fabio Ricci <[email protected]>
> wrote:
> 
>> Hi
>> 
>> is there somebody here ;) - Don’t expect you on Easter…
>> 
>> NUTCH 1.13 stores in the dump incomplete websites.
>> 
>> Is there a way to instruct it to download all content of a website, from
>> <html> to </html> ?
>> 
>> Thank you very much in advance
>> 
>> Regards
>> Fabio

Re: Length of downloaded pages

Reply via email to