Lewis: I've update the following parameter to true:
<property> <name>fetcher.verbose</name> <value>true</value> <description>If true, fetcher will log more verbosely.</description> </property> But this doesn't seem to be generating a much extra output. When I run the parserchecker against the site in question I get the following output: --------- Url --------------- http://www.mysite.com/webpage --------- Metadata --------- --------- ParseText --------- All the Content of my page There are just a few question marks in the parsed text that the bash shell seems to be struggling with but the page content (under ParseText) looks complete. On Wed, Feb 6, 2013 at 9:53 PM, Lewis John Mcgibbney < [email protected]> wrote: > Can you use the parsechecker tool with fetcher.verbose overriden as true > and the same settings on one of the (HTML?) documents giving you bother? > The gora-sql-0.1.1 -incubating module is becoming a real pain to be honest. > > On Wed, Feb 6, 2013 at 6:44 PM, Ward Loving <[email protected]> wrote: > > > Hello: > > > > I've got Nutch up and running except for one big problem. It is > truncating > > the content of my downloaded pages at 27,000 or 28,000 bytes. Basically > it > > just slices the end of my web pages off and of course that completely > hoses > > any downstream parsing of the tags and content that I'd like to do. > > > > This error is driving me crazy. I've done the typical things like update > > the http.content.limit: > > > > <property> > > <name>http.content.limit</name> > > <value>-1</value> > > <description>The length limit for downloaded content using the http > > protocol, in bytes. If this value is nonnegative (>=0), content longer > > than it will be truncated; otherwise, no truncation at all. Do not > > confuse this setting with the file.content.limit setting. > > </description> > > </property> > > > > And I've updated the content setting in my gora-sql-mapping.xml to a huge > > number: > > > > <class name="org.apache.nutch.storage.WebPage" > keyClass="java.lang.String" > > table="webpage"> > > <primarykey column="id" length="767"/> > > <field name="baseUrl" column="baseUrl" length="512"/> > > <field name="status" column="status"/> > > <field name="prevFetchTime" column="prevFetchTime"/> > > <field name="fetchTime" column="fetchTime"/> > > <field name="fetchInterval" column="fetchInterval"/> > > <field name="retriesSinceFetch" column="retriesSinceFetch"/> > > <field name="reprUrl" column="reprUrl" length="512"/> > > <!-- <field name="content" column="content" length="65536"/>--> > > *<field name="content" column="content" length="262144"/>* > > <field name="contentType" column="typ" length="32"/> > > <field name="protocolStatus" column="protocolStatus"/> > > <field name="modifiedTime" column="modifiedTime"/> > > > > <!-- parse fields --> > > <field name="title" column="title" length="512"/> > > <field name="text" column="text" length="32000"/> > > <field name="parseStatus" column="parseStatus"/> > > <field name="signature" column="signature"/> > > <field name="prevSignature" column="prevSignature"/> > > > > <!-- score fields --> > > <field name="score" column="score"/> > > <field name="headers" column="headers"/> > > <field name="inlinks" column="inlinks"/> > > <field name="outlinks" column="outlinks"/> > > <field name="metadata" column="metadata"/> > > <field name="markers" column="markers"/> > > </class> > > > > But I'm getting no love here. Any other ideas what could be trimming > the > > content? > > > > -- > > Ward Loving > > Senior Technical Consultant > > Appirio, Inc. > > www.appirio.com > > (706) 225-9475 > > > > > > -- > *Lewis* > -- Ward Loving Senior Technical Consultant Appirio, Inc. www.appirio.com (706) 225-9475

