Hi Tony,
You may need to add http.content.limit parameter in nutch-site.xml file.
for size-unlimited crawling:
<property>
<name>http.content.limit</name>
*<value>-1</value>*
<description>The length limit for downloaded content using the file
protocol, in bytes. If this value is nonnegative (>=0),
content longer
than it will be truncated; otherwise, no truncation at all.
Do not
confuse this setting with the http.content.limit setting.
</description>
</property>
Please refer to: http://wiki.apache.org/nutch/nutch-default.xml
Kind regards..
coskun...
On 06/17/2013 02:05 PM, Tony Mullins wrote:
Hi ,
I am trying to crawl this url
http://www.amazon.com/Levis-Mens-550-Relaxed-Jean/dp/B0018OKX68
and getting the crawled page content in my ParseFIlter plugin like this
String html = new String(webPage.getContent().array());
Then I am using this html to extract my required information....
But its not returning me complete html of page. I have logged the 'html'
and I can see that log file contains incomplete html for the above link....
Is there any size limit of page' content ? Or I am doing something wrong
here ?
Thanks,
Tony.