Re: Incomplete HTML content of a crawled Page in ParseFilter ?

H. Coskun Gunduz Mon, 17 Jun 2013 04:16:15 -0700

Hi Tony,

You may need to add http.content.limit parameter in nutch-site.xml file.


for size-unlimited crawling:

<property>
        <name>http.content.limit</name>
*<value>-1</value>*
        <description>The length limit for downloaded content using the file

protocol, in bytes. If this value is nonnegative (>=0),content longerthan it will be truncated; otherwise, no truncation at all.Do not

            confuse this setting with the http.content.limit setting.
        </description>
    </property>


Please refer to: http://wiki.apache.org/nutch/nutch-default.xml

Kind regards..
coskun...

On 06/17/2013 02:05 PM, Tony Mullins wrote:

Hi ,

I am trying to crawl this url
http://www.amazon.com/Levis-Mens-550-Relaxed-Jean/dp/B0018OKX68
and getting the crawled page content in my ParseFIlter plugin like this
String html = new String(webPage.getContent().array());
Then I am using this html to extract my required information....

But its not returning me complete html of page. I have logged the 'html'
and I can see that log file contains incomplete html for the above link....

Is there any size limit of page' content ? Or I am doing something wrong
here ?

Thanks,
Tony.

Re: Incomplete HTML content of a crawled Page in ParseFilter ?

Reply via email to