Hello:
I've got Nutch up and running except for one big problem. It is truncating
the content of my downloaded pages at 27,000 or 28,000 bytes. Basically it
just slices the end of my web pages off and of course that completely hoses
any downstream parsing of the tags and content that I'd like to do.
This error is driving me crazy. I've done the typical things like update
the http.content.limit:
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content using the http
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
And I've updated the content setting in my gora-sql-mapping.xml to a huge
number:
<class name="org.apache.nutch.storage.WebPage" keyClass="java.lang.String"
table="webpage">
<primarykey column="id" length="767"/>
<field name="baseUrl" column="baseUrl" length="512"/>
<field name="status" column="status"/>
<field name="prevFetchTime" column="prevFetchTime"/>
<field name="fetchTime" column="fetchTime"/>
<field name="fetchInterval" column="fetchInterval"/>
<field name="retriesSinceFetch" column="retriesSinceFetch"/>
<field name="reprUrl" column="reprUrl" length="512"/>
<!-- <field name="content" column="content" length="65536"/>-->
*<field name="content" column="content" length="262144"/>*
<field name="contentType" column="typ" length="32"/>
<field name="protocolStatus" column="protocolStatus"/>
<field name="modifiedTime" column="modifiedTime"/>
<!-- parse fields -->
<field name="title" column="title" length="512"/>
<field name="text" column="text" length="32000"/>
<field name="parseStatus" column="parseStatus"/>
<field name="signature" column="signature"/>
<field name="prevSignature" column="prevSignature"/>
<!-- score fields -->
<field name="score" column="score"/>
<field name="headers" column="headers"/>
<field name="inlinks" column="inlinks"/>
<field name="outlinks" column="outlinks"/>
<field name="metadata" column="metadata"/>
<field name="markers" column="markers"/>
</class>
But I'm getting no love here. Any other ideas what could be trimming the
content?
--
Ward Loving
Senior Technical Consultant
Appirio, Inc.
www.appirio.com
(706) 225-9475