Well, I spoke to soon. I ran a crawl overnight and I'm seeing all kinds of truncation happening again. I can hardly find a content field in my database that hasn't been truncated. I'm seeing a ton of these warning messages in the log:
2013-02-08 19:40:36,861 WARN parse.ParserJob - http://www.episcopalchurch.org/parish/university-texas-austin-tx skipped. Content of size 30220 was truncated to 29919 2013-02-08 19:40:36,861 INFO parse.ParserJob - Parsing http://www.episcopalchurch.org/parish/varina-church-richmond-va 2013-02-08 19:40:36,861 WARN parse.ParserJob - http://www.episcopalchurch.org/parish/varina-church-richmond-va skipped. Content of size 29559 was truncated to 28471 2013-02-08 19:40:36,861 INFO parse.ParserJob - Parsing http://www.episcopalchurch.org/parish/vauters-church-champlain-va This is sort of bizarrre. I spot checked 5 pages when I first started the process yesterday morning and all the content in the content fields was complete. Now I'm running it again and nothing is, but I don't see the warning messages that anything is amiss with the data with the first couple of pages I fetched. I've tried updating the following setting to false but it doesn't seem to help: <property> <name>parser.skip.truncated</name> <value>false</value> <description>Boolean value for whether we should skip parsing for truncated documents. By default this property is activated due to extremely high levels of CPU which parsing can sometimes take. </description> </property> On Thu, Feb 7, 2013 at 5:24 PM, Ward Loving <[email protected]> wrote: > Yep, looks like it. The configuration is tricky no doubt. In my case, > however, I think I had actually fixed the config, I just couldn't tell that > I had resolved the issue. I was looking at stale data. > > > On Thu, Feb 7, 2013 at 5:12 PM, Lewis John Mcgibbney < > [email protected]> wrote: > >> So the problem for you is resolved? >> The main (typical) problem here is in the underlying gora-sql library and >> some rather difficult to master gora-sql-mapping.xml constraints. >> Hope all is resolved >> Lewis >> >> On Thu, Feb 7, 2013 at 1:57 PM, Ward Loving <[email protected]> wrote: >> >> > Alright...very good news. I guess something I did fixed the issue. >> Once I >> > dropped my webpage table and restarted the process, I'm now getting >> > complete pages. The actual load of the data to that field can happen >> > somewhat later than the fetch entry in the logs. Easy to see when >> > inserting data the first time around. Not as simple to detect when >> you've >> > loaded data previously. Thanks for your assistance. >> > >> > >> > >> > >> > >> > >> > On Thu, Feb 7, 2013 at 3:01 PM, Lewis John Mcgibbney < >> > [email protected]> wrote: >> > >> > > It will prduce more output on the fetcher part of your hadoop.log not >> on >> > > the parsechecker tool itself that is why you are seeing nothing more. >> > > Are you still having problems with the truncation aspect? >> > > Lewis >> > > >> > > On Thu, Feb 7, 2013 at 11:07 AM, Ward Loving <[email protected]> >> wrote: >> > > >> > > > Lewis: >> > > > >> > > > >> > > >> > >> > >> > >> > -- >> > Ward Loving >> > Senior Technical Consultant >> > Appirio, Inc. >> > www.appirio.com >> > (706) 225-9475 >> > >> >> >> >> -- >> *Lewis* >> > > > > -- > Ward Loving > Senior Technical Consultant > Appirio, Inc. > www.appirio.com > (706) 225-9475 > -- Ward Loving Senior Technical Consultant Appirio, Inc. www.appirio.com (706) 225-9475

