Can you use the parsechecker tool with fetcher.verbose overriden as true
and the same settings on one of the (HTML?) documents giving you bother?
The gora-sql-0.1.1 -incubating module is becoming a real pain to be honest.

On Wed, Feb 6, 2013 at 6:44 PM, Ward Loving <[email protected]> wrote:

> Hello:
>
> I've got Nutch up and running except for one big problem.  It is truncating
> the content of my downloaded pages at 27,000 or 28,000 bytes.  Basically it
> just slices the end of my web pages off and of course that completely hoses
> any downstream parsing of the tags and content that I'd like to do.
>
> This error is driving me crazy.  I've done the typical things like update
> the http.content.limit:
>
> <property>
>   <name>http.content.limit</name>
>   <value>-1</value>
>   <description>The length limit for downloaded content using the http
>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>   than it will be truncated; otherwise, no truncation at all. Do not
>   confuse this setting with the file.content.limit setting.
>   </description>
> </property>
>
> And I've updated the content setting in my gora-sql-mapping.xml to a huge
> number:
>
> <class name="org.apache.nutch.storage.WebPage" keyClass="java.lang.String"
> table="webpage">
>   <primarykey column="id" length="767"/>
>     <field name="baseUrl" column="baseUrl" length="512"/>
>     <field name="status" column="status"/>
>     <field name="prevFetchTime" column="prevFetchTime"/>
>     <field name="fetchTime" column="fetchTime"/>
>     <field name="fetchInterval" column="fetchInterval"/>
>     <field name="retriesSinceFetch" column="retriesSinceFetch"/>
>     <field name="reprUrl" column="reprUrl" length="512"/>
>     <!-- <field name="content" column="content" length="65536"/>-->
>     *<field name="content" column="content" length="262144"/>*
>     <field name="contentType" column="typ" length="32"/>
>     <field name="protocolStatus" column="protocolStatus"/>
>     <field name="modifiedTime" column="modifiedTime"/>
>
>     <!-- parse fields                                       -->
>     <field name="title" column="title" length="512"/>
>     <field name="text" column="text" length="32000"/>
>     <field name="parseStatus" column="parseStatus"/>
>     <field name="signature" column="signature"/>
>     <field name="prevSignature" column="prevSignature"/>
>
>     <!-- score fields                                       -->
>     <field name="score" column="score"/>
>     <field name="headers" column="headers"/>
>     <field name="inlinks" column="inlinks"/>
>     <field name="outlinks" column="outlinks"/>
>     <field name="metadata" column="metadata"/>
>     <field name="markers" column="markers"/>
> </class>
>
> But I'm getting no love here.   Any other ideas what could be trimming the
> content?
>
> --
> Ward Loving
> Senior Technical Consultant
> Appirio, Inc.
> www.appirio.com
> (706) 225-9475
>



-- 
*Lewis*

Reply via email to