Lewis:

I've update the following parameter to true:

<property>
  <name>fetcher.verbose</name>
  <value>true</value>
  <description>If true, fetcher will log more verbosely.</description>
</property>

But this doesn't seem to be generating a much extra output.  When I run the
parserchecker against the site in question I get the following output:

---------
Url
---------------
http://www.mysite.com/webpage
---------
Metadata
---------
---------
ParseText
---------
All the Content of my page

There are just a few question marks in the parsed text that the bash shell
seems to be struggling with but the page content (under ParseText) looks
complete.




On Wed, Feb 6, 2013 at 9:53 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Can you use the parsechecker tool with fetcher.verbose overriden as true
> and the same settings on one of the (HTML?) documents giving you bother?
> The gora-sql-0.1.1 -incubating module is becoming a real pain to be honest.
>
> On Wed, Feb 6, 2013 at 6:44 PM, Ward Loving <[email protected]> wrote:
>
> > Hello:
> >
> > I've got Nutch up and running except for one big problem.  It is
> truncating
> > the content of my downloaded pages at 27,000 or 28,000 bytes.  Basically
> it
> > just slices the end of my web pages off and of course that completely
> hoses
> > any downstream parsing of the tags and content that I'd like to do.
> >
> > This error is driving me crazy.  I've done the typical things like update
> > the http.content.limit:
> >
> > <property>
> >   <name>http.content.limit</name>
> >   <value>-1</value>
> >   <description>The length limit for downloaded content using the http
> >   protocol, in bytes. If this value is nonnegative (>=0), content longer
> >   than it will be truncated; otherwise, no truncation at all. Do not
> >   confuse this setting with the file.content.limit setting.
> >   </description>
> > </property>
> >
> > And I've updated the content setting in my gora-sql-mapping.xml to a huge
> > number:
> >
> > <class name="org.apache.nutch.storage.WebPage"
> keyClass="java.lang.String"
> > table="webpage">
> >   <primarykey column="id" length="767"/>
> >     <field name="baseUrl" column="baseUrl" length="512"/>
> >     <field name="status" column="status"/>
> >     <field name="prevFetchTime" column="prevFetchTime"/>
> >     <field name="fetchTime" column="fetchTime"/>
> >     <field name="fetchInterval" column="fetchInterval"/>
> >     <field name="retriesSinceFetch" column="retriesSinceFetch"/>
> >     <field name="reprUrl" column="reprUrl" length="512"/>
> >     <!-- <field name="content" column="content" length="65536"/>-->
> >     *<field name="content" column="content" length="262144"/>*
> >     <field name="contentType" column="typ" length="32"/>
> >     <field name="protocolStatus" column="protocolStatus"/>
> >     <field name="modifiedTime" column="modifiedTime"/>
> >
> >     <!-- parse fields                                       -->
> >     <field name="title" column="title" length="512"/>
> >     <field name="text" column="text" length="32000"/>
> >     <field name="parseStatus" column="parseStatus"/>
> >     <field name="signature" column="signature"/>
> >     <field name="prevSignature" column="prevSignature"/>
> >
> >     <!-- score fields                                       -->
> >     <field name="score" column="score"/>
> >     <field name="headers" column="headers"/>
> >     <field name="inlinks" column="inlinks"/>
> >     <field name="outlinks" column="outlinks"/>
> >     <field name="metadata" column="metadata"/>
> >     <field name="markers" column="markers"/>
> > </class>
> >
> > But I'm getting no love here.   Any other ideas what could be trimming
> the
> > content?
> >
> > --
> > Ward Loving
> > Senior Technical Consultant
> > Appirio, Inc.
> > www.appirio.com
> > (706) 225-9475
> >
>
>
>
> --
> *Lewis*
>



-- 
Ward Loving
Senior Technical Consultant
Appirio, Inc.
www.appirio.com
(706) 225-9475

Reply via email to