Well,

I spoke to soon.  I ran a crawl overnight and I'm seeing all kinds of
truncation happening again.   I can hardly find a content field in my
database that hasn't been truncated.  I'm seeing a ton of these warning
messages in the log:

2013-02-08 19:40:36,861 WARN  parse.ParserJob -
http://www.episcopalchurch.org/parish/university-texas-austin-tx skipped.
Content of size 30220 was truncated to 29919
2013-02-08 19:40:36,861 INFO  parse.ParserJob - Parsing
http://www.episcopalchurch.org/parish/varina-church-richmond-va
2013-02-08 19:40:36,861 WARN  parse.ParserJob -
http://www.episcopalchurch.org/parish/varina-church-richmond-va skipped.
Content of size 29559 was truncated to 28471
2013-02-08 19:40:36,861 INFO  parse.ParserJob - Parsing
http://www.episcopalchurch.org/parish/vauters-church-champlain-va

This is sort of bizarrre.  I spot checked 5 pages when I first started the
process yesterday morning and all the content in the content fields was
complete.  Now I'm running it again and nothing is, but I don't see the
warning messages that anything is amiss with the data with the first couple
of pages I fetched.  I've tried updating the following setting to false but
it doesn't seem to help:

<property>
  <name>parser.skip.truncated</name>
  <value>false</value>
  <description>Boolean value for whether we should skip parsing for
truncated documents. By default this
  property is activated due to extremely high levels of CPU which parsing
can sometimes take.
  </description>
</property>






On Thu, Feb 7, 2013 at 5:24 PM, Ward Loving <[email protected]> wrote:

> Yep, looks like it.  The configuration is tricky no doubt.  In my case,
> however, I think I had actually fixed the config, I just couldn't tell that
> I had resolved the issue.  I was looking at stale data.
>
>
> On Thu, Feb 7, 2013 at 5:12 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> So the problem for you is resolved?
>> The main (typical) problem here is in the underlying gora-sql library and
>> some rather difficult to master gora-sql-mapping.xml constraints.
>> Hope all is resolved
>> Lewis
>>
>> On Thu, Feb 7, 2013 at 1:57 PM, Ward Loving <[email protected]> wrote:
>>
>> > Alright...very good news.  I guess something I did fixed the issue.
>>  Once I
>> > dropped my webpage table and restarted the process, I'm now getting
>> > complete pages.  The actual load of the data to that field can happen
>> > somewhat later than the fetch entry in the logs.  Easy to see when
>> > inserting data the first time around.  Not as simple to detect when
>> you've
>> > loaded data previously. Thanks for your assistance.
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Thu, Feb 7, 2013 at 3:01 PM, Lewis John Mcgibbney <
>> > [email protected]> wrote:
>> >
>> > > It will prduce more output on the fetcher part of your hadoop.log not
>> on
>> > > the parsechecker tool itself that is why you are seeing nothing more.
>> > > Are you still having problems with the truncation aspect?
>> > > Lewis
>> > >
>> > > On Thu, Feb 7, 2013 at 11:07 AM, Ward Loving <[email protected]>
>> wrote:
>> > >
>> > > > Lewis:
>> > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Ward Loving
>> > Senior Technical Consultant
>> > Appirio, Inc.
>> > www.appirio.com
>> > (706) 225-9475
>> >
>>
>>
>>
>> --
>> *Lewis*
>>
>
>
>
> --
> Ward Loving
> Senior Technical Consultant
> Appirio, Inc.
> www.appirio.com
> (706) 225-9475
>



-- 
Ward Loving
Senior Technical Consultant
Appirio, Inc.
www.appirio.com
(706) 225-9475

Reply via email to