Can you use the parsechecker tool with fetcher.verbose overriden as true and the same settings on one of the (HTML?) documents giving you bother? The gora-sql-0.1.1 -incubating module is becoming a real pain to be honest.
On Wed, Feb 6, 2013 at 6:44 PM, Ward Loving <[email protected]> wrote: > Hello: > > I've got Nutch up and running except for one big problem. It is truncating > the content of my downloaded pages at 27,000 or 28,000 bytes. Basically it > just slices the end of my web pages off and of course that completely hoses > any downstream parsing of the tags and content that I'd like to do. > > This error is driving me crazy. I've done the typical things like update > the http.content.limit: > > <property> > <name>http.content.limit</name> > <value>-1</value> > <description>The length limit for downloaded content using the http > protocol, in bytes. If this value is nonnegative (>=0), content longer > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the file.content.limit setting. > </description> > </property> > > And I've updated the content setting in my gora-sql-mapping.xml to a huge > number: > > <class name="org.apache.nutch.storage.WebPage" keyClass="java.lang.String" > table="webpage"> > <primarykey column="id" length="767"/> > <field name="baseUrl" column="baseUrl" length="512"/> > <field name="status" column="status"/> > <field name="prevFetchTime" column="prevFetchTime"/> > <field name="fetchTime" column="fetchTime"/> > <field name="fetchInterval" column="fetchInterval"/> > <field name="retriesSinceFetch" column="retriesSinceFetch"/> > <field name="reprUrl" column="reprUrl" length="512"/> > <!-- <field name="content" column="content" length="65536"/>--> > *<field name="content" column="content" length="262144"/>* > <field name="contentType" column="typ" length="32"/> > <field name="protocolStatus" column="protocolStatus"/> > <field name="modifiedTime" column="modifiedTime"/> > > <!-- parse fields --> > <field name="title" column="title" length="512"/> > <field name="text" column="text" length="32000"/> > <field name="parseStatus" column="parseStatus"/> > <field name="signature" column="signature"/> > <field name="prevSignature" column="prevSignature"/> > > <!-- score fields --> > <field name="score" column="score"/> > <field name="headers" column="headers"/> > <field name="inlinks" column="inlinks"/> > <field name="outlinks" column="outlinks"/> > <field name="metadata" column="metadata"/> > <field name="markers" column="markers"/> > </class> > > But I'm getting no love here. Any other ideas what could be trimming the > content? > > -- > Ward Loving > Senior Technical Consultant > Appirio, Inc. > www.appirio.com > (706) 225-9475 > -- *Lewis*

