Hi Lewis:

Well, I've done some additional testing and the truncation issue seems to
be isolated to the particular web server/site that I'm trying to process.
 When I run the process against other sites, I'm not seeing the same issue.
 I guess for processing that site I'll have to go with Plan B.

Thanks for your help.

Ward


On Sun, Feb 10, 2013 at 8:19 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> No content should be truncated if you set http.content.limit to -1 and
> leave the default settings on. It is as simple as that.
> Have you recompiled Nutch with some changes you made before continuing
> crawling?
>
> On Fri, Feb 8, 2013 at 9:01 PM, Ward Loving <[email protected]> wrote:
>
> > Well,
> >
> > I spoke to soon.  I ran a crawl overnight and I'm seeing all kinds of
> > truncation happening again.   I can hardly find a content field in my
> > database that hasn't been truncated.  I'm seeing a ton of these warning
> > messages in the log:
> >
> > 2013-02-08 19:40:36,861 WARN  parse.ParserJob -
> > http://www.episcopalchurch.org/parish/university-texas-austin-txskipped.
> > Content of size 30220 was truncated to 29919
> > 2013-02-08 19:40:36,861 INFO  parse.ParserJob - Parsing
> > http://www.episcopalchurch.org/parish/varina-church-richmond-va
> > 2013-02-08 19:40:36,861 WARN  parse.ParserJob -
> > http://www.episcopalchurch.org/parish/varina-church-richmond-va skipped.
> > Content of size 29559 was truncated to 28471
> > 2013-02-08 19:40:36,861 INFO  parse.ParserJob - Parsing
> > http://www.episcopalchurch.org/parish/vauters-church-champlain-va
> >
> > This is sort of bizarrre.  I spot checked 5 pages when I first started
> the
> > process yesterday morning and all the content in the content fields was
> > complete.  Now I'm running it again and nothing is, but I don't see the
> > warning messages that anything is amiss with the data with the first
> couple
> > of pages I fetched.  I've tried updating the following setting to false
> but
> > it doesn't seem to help:
> >
> > <property>
> >   <name>parser.skip.truncated</name>
> >   <value>false</value>
> >   <description>Boolean value for whether we should skip parsing for
> > truncated documents. By default this
> >   property is activated due to extremely high levels of CPU which parsing
> > can sometimes take.
> >   </description>
> > </property>
> >
> >
> >
> >
> >
> >
> > On Thu, Feb 7, 2013 at 5:24 PM, Ward Loving <[email protected]> wrote:
> >
> > > Yep, looks like it.  The configuration is tricky no doubt.  In my case,
> > > however, I think I had actually fixed the config, I just couldn't tell
> > that
> > > I had resolved the issue.  I was looking at stale data.
> > >
> > >
> > > On Thu, Feb 7, 2013 at 5:12 PM, Lewis John Mcgibbney <
> > > [email protected]> wrote:
> > >
> > >> So the problem for you is resolved?
> > >> The main (typical) problem here is in the underlying gora-sql library
> > and
> > >> some rather difficult to master gora-sql-mapping.xml constraints.
> > >> Hope all is resolved
> > >> Lewis
> > >>
> > >> On Thu, Feb 7, 2013 at 1:57 PM, Ward Loving <[email protected]> wrote:
> > >>
> > >> > Alright...very good news.  I guess something I did fixed the issue.
> > >>  Once I
> > >> > dropped my webpage table and restarted the process, I'm now getting
> > >> > complete pages.  The actual load of the data to that field can
> happen
> > >> > somewhat later than the fetch entry in the logs.  Easy to see when
> > >> > inserting data the first time around.  Not as simple to detect when
> > >> you've
> > >> > loaded data previously. Thanks for your assistance.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Thu, Feb 7, 2013 at 3:01 PM, Lewis John Mcgibbney <
> > >> > [email protected]> wrote:
> > >> >
> > >> > > It will prduce more output on the fetcher part of your hadoop.log
> > not
> > >> on
> > >> > > the parsechecker tool itself that is why you are seeing nothing
> > more.
> > >> > > Are you still having problems with the truncation aspect?
> > >> > > Lewis
> > >> > >
> > >> > > On Thu, Feb 7, 2013 at 11:07 AM, Ward Loving <[email protected]>
> > >> wrote:
> > >> > >
> > >> > > > Lewis:
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Ward Loving
> > >> > Senior Technical Consultant
> > >> > Appirio, Inc.
> > >> > www.appirio.com
> > >> > (706) 225-9475
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> *Lewis*
> > >>
> > >
> > >
> > >
> > > --
> > > Ward Loving
> > > Senior Technical Consultant
> > > Appirio, Inc.
> > > www.appirio.com
> > > (706) 225-9475
> > >
> >
> >
> >
> > --
> > Ward Loving
> > Senior Technical Consultant
> > Appirio, Inc.
> > www.appirio.com
> > (706) 225-9475
> >
>
>
>
> --
> *Lewis*
>



-- 
Ward Loving
Senior Technical Consultant
Appirio, Inc.
www.appirio.com
(706) 225-9475

Reply via email to