Re: Content Truncation in Nutch 2.1/MySQL

Lewis John Mcgibbney Sun, 10 Feb 2013 17:20:01 -0800

No content should be truncated if you set http.content.limit to -1 and
leave the default settings on. It is as simple as that.
Have you recompiled Nutch with some changes you made before continuing
crawling?


On Fri, Feb 8, 2013 at 9:01 PM, Ward Loving <[email protected]> wrote:

> Well,
>
> I spoke to soon.  I ran a crawl overnight and I'm seeing all kinds of
> truncation happening again.   I can hardly find a content field in my
> database that hasn't been truncated.  I'm seeing a ton of these warning
> messages in the log:
>
> 2013-02-08 19:40:36,861 WARN  parse.ParserJob -
> http://www.episcopalchurch.org/parish/university-texas-austin-tx skipped.
> Content of size 30220 was truncated to 29919
> 2013-02-08 19:40:36,861 INFO  parse.ParserJob - Parsing
> http://www.episcopalchurch.org/parish/varina-church-richmond-va
> 2013-02-08 19:40:36,861 WARN  parse.ParserJob -
> http://www.episcopalchurch.org/parish/varina-church-richmond-va skipped.
> Content of size 29559 was truncated to 28471
> 2013-02-08 19:40:36,861 INFO  parse.ParserJob - Parsing
> http://www.episcopalchurch.org/parish/vauters-church-champlain-va
>
> This is sort of bizarrre.  I spot checked 5 pages when I first started the
> process yesterday morning and all the content in the content fields was
> complete.  Now I'm running it again and nothing is, but I don't see the
> warning messages that anything is amiss with the data with the first couple
> of pages I fetched.  I've tried updating the following setting to false but
> it doesn't seem to help:
>
> <property>
>   <name>parser.skip.truncated</name>
>   <value>false</value>
>   <description>Boolean value for whether we should skip parsing for
> truncated documents. By default this
>   property is activated due to extremely high levels of CPU which parsing
> can sometimes take.
>   </description>
> </property>
>
>
>
>
>
>
> On Thu, Feb 7, 2013 at 5:24 PM, Ward Loving <[email protected]> wrote:
>
> > Yep, looks like it.  The configuration is tricky no doubt.  In my case,
> > however, I think I had actually fixed the config, I just couldn't tell
> that
> > I had resolved the issue.  I was looking at stale data.
> >
> >
> > On Thu, Feb 7, 2013 at 5:12 PM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> >> So the problem for you is resolved?
> >> The main (typical) problem here is in the underlying gora-sql library
> and
> >> some rather difficult to master gora-sql-mapping.xml constraints.
> >> Hope all is resolved
> >> Lewis
> >>
> >> On Thu, Feb 7, 2013 at 1:57 PM, Ward Loving <[email protected]> wrote:
> >>
> >> > Alright...very good news.  I guess something I did fixed the issue.
> >>  Once I
> >> > dropped my webpage table and restarted the process, I'm now getting
> >> > complete pages.  The actual load of the data to that field can happen
> >> > somewhat later than the fetch entry in the logs.  Easy to see when
> >> > inserting data the first time around.  Not as simple to detect when
> >> you've
> >> > loaded data previously. Thanks for your assistance.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Thu, Feb 7, 2013 at 3:01 PM, Lewis John Mcgibbney <
> >> > [email protected]> wrote:
> >> >
> >> > > It will prduce more output on the fetcher part of your hadoop.log
> not
> >> on
> >> > > the parsechecker tool itself that is why you are seeing nothing
> more.
> >> > > Are you still having problems with the truncation aspect?
> >> > > Lewis
> >> > >
> >> > > On Thu, Feb 7, 2013 at 11:07 AM, Ward Loving <[email protected]>
> >> wrote:
> >> > >
> >> > > > Lewis:
> >> > > >
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Ward Loving
> >> > Senior Technical Consultant
> >> > Appirio, Inc.
> >> > www.appirio.com
> >> > (706) 225-9475
> >> >
> >>
> >>
> >>
> >> --
> >> *Lewis*
> >>
> >
> >
> >
> > --
> > Ward Loving
> > Senior Technical Consultant
> > Appirio, Inc.
> > www.appirio.com
> > (706) 225-9475
> >
>
>
>
> --
> Ward Loving
> Senior Technical Consultant
> Appirio, Inc.
> www.appirio.com
> (706) 225-9475
>



-- 
*Lewis*

Re: Content Truncation in Nutch 2.1/MySQL

Reply via email to