Hi Lewis: Well, I've done some additional testing and the truncation issue seems to be isolated to the particular web server/site that I'm trying to process. When I run the process against other sites, I'm not seeing the same issue. I guess for processing that site I'll have to go with Plan B.
Thanks for your help. Ward On Sun, Feb 10, 2013 at 8:19 PM, Lewis John Mcgibbney < [email protected]> wrote: > No content should be truncated if you set http.content.limit to -1 and > leave the default settings on. It is as simple as that. > Have you recompiled Nutch with some changes you made before continuing > crawling? > > On Fri, Feb 8, 2013 at 9:01 PM, Ward Loving <[email protected]> wrote: > > > Well, > > > > I spoke to soon. I ran a crawl overnight and I'm seeing all kinds of > > truncation happening again. I can hardly find a content field in my > > database that hasn't been truncated. I'm seeing a ton of these warning > > messages in the log: > > > > 2013-02-08 19:40:36,861 WARN parse.ParserJob - > > http://www.episcopalchurch.org/parish/university-texas-austin-txskipped. > > Content of size 30220 was truncated to 29919 > > 2013-02-08 19:40:36,861 INFO parse.ParserJob - Parsing > > http://www.episcopalchurch.org/parish/varina-church-richmond-va > > 2013-02-08 19:40:36,861 WARN parse.ParserJob - > > http://www.episcopalchurch.org/parish/varina-church-richmond-va skipped. > > Content of size 29559 was truncated to 28471 > > 2013-02-08 19:40:36,861 INFO parse.ParserJob - Parsing > > http://www.episcopalchurch.org/parish/vauters-church-champlain-va > > > > This is sort of bizarrre. I spot checked 5 pages when I first started > the > > process yesterday morning and all the content in the content fields was > > complete. Now I'm running it again and nothing is, but I don't see the > > warning messages that anything is amiss with the data with the first > couple > > of pages I fetched. I've tried updating the following setting to false > but > > it doesn't seem to help: > > > > <property> > > <name>parser.skip.truncated</name> > > <value>false</value> > > <description>Boolean value for whether we should skip parsing for > > truncated documents. By default this > > property is activated due to extremely high levels of CPU which parsing > > can sometimes take. > > </description> > > </property> > > > > > > > > > > > > > > On Thu, Feb 7, 2013 at 5:24 PM, Ward Loving <[email protected]> wrote: > > > > > Yep, looks like it. The configuration is tricky no doubt. In my case, > > > however, I think I had actually fixed the config, I just couldn't tell > > that > > > I had resolved the issue. I was looking at stale data. > > > > > > > > > On Thu, Feb 7, 2013 at 5:12 PM, Lewis John Mcgibbney < > > > [email protected]> wrote: > > > > > >> So the problem for you is resolved? > > >> The main (typical) problem here is in the underlying gora-sql library > > and > > >> some rather difficult to master gora-sql-mapping.xml constraints. > > >> Hope all is resolved > > >> Lewis > > >> > > >> On Thu, Feb 7, 2013 at 1:57 PM, Ward Loving <[email protected]> wrote: > > >> > > >> > Alright...very good news. I guess something I did fixed the issue. > > >> Once I > > >> > dropped my webpage table and restarted the process, I'm now getting > > >> > complete pages. The actual load of the data to that field can > happen > > >> > somewhat later than the fetch entry in the logs. Easy to see when > > >> > inserting data the first time around. Not as simple to detect when > > >> you've > > >> > loaded data previously. Thanks for your assistance. > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > On Thu, Feb 7, 2013 at 3:01 PM, Lewis John Mcgibbney < > > >> > [email protected]> wrote: > > >> > > > >> > > It will prduce more output on the fetcher part of your hadoop.log > > not > > >> on > > >> > > the parsechecker tool itself that is why you are seeing nothing > > more. > > >> > > Are you still having problems with the truncation aspect? > > >> > > Lewis > > >> > > > > >> > > On Thu, Feb 7, 2013 at 11:07 AM, Ward Loving <[email protected]> > > >> wrote: > > >> > > > > >> > > > Lewis: > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > -- > > >> > Ward Loving > > >> > Senior Technical Consultant > > >> > Appirio, Inc. > > >> > www.appirio.com > > >> > (706) 225-9475 > > >> > > > >> > > >> > > >> > > >> -- > > >> *Lewis* > > >> > > > > > > > > > > > > -- > > > Ward Loving > > > Senior Technical Consultant > > > Appirio, Inc. > > > www.appirio.com > > > (706) 225-9475 > > > > > > > > > > > -- > > Ward Loving > > Senior Technical Consultant > > Appirio, Inc. > > www.appirio.com > > (706) 225-9475 > > > > > > -- > *Lewis* > -- Ward Loving Senior Technical Consultant Appirio, Inc. www.appirio.com (706) 225-9475

