No content should be truncated if you set http.content.limit to -1 and leave the default settings on. It is as simple as that. Have you recompiled Nutch with some changes you made before continuing crawling?
On Fri, Feb 8, 2013 at 9:01 PM, Ward Loving <[email protected]> wrote: > Well, > > I spoke to soon. I ran a crawl overnight and I'm seeing all kinds of > truncation happening again. I can hardly find a content field in my > database that hasn't been truncated. I'm seeing a ton of these warning > messages in the log: > > 2013-02-08 19:40:36,861 WARN parse.ParserJob - > http://www.episcopalchurch.org/parish/university-texas-austin-tx skipped. > Content of size 30220 was truncated to 29919 > 2013-02-08 19:40:36,861 INFO parse.ParserJob - Parsing > http://www.episcopalchurch.org/parish/varina-church-richmond-va > 2013-02-08 19:40:36,861 WARN parse.ParserJob - > http://www.episcopalchurch.org/parish/varina-church-richmond-va skipped. > Content of size 29559 was truncated to 28471 > 2013-02-08 19:40:36,861 INFO parse.ParserJob - Parsing > http://www.episcopalchurch.org/parish/vauters-church-champlain-va > > This is sort of bizarrre. I spot checked 5 pages when I first started the > process yesterday morning and all the content in the content fields was > complete. Now I'm running it again and nothing is, but I don't see the > warning messages that anything is amiss with the data with the first couple > of pages I fetched. I've tried updating the following setting to false but > it doesn't seem to help: > > <property> > <name>parser.skip.truncated</name> > <value>false</value> > <description>Boolean value for whether we should skip parsing for > truncated documents. By default this > property is activated due to extremely high levels of CPU which parsing > can sometimes take. > </description> > </property> > > > > > > > On Thu, Feb 7, 2013 at 5:24 PM, Ward Loving <[email protected]> wrote: > > > Yep, looks like it. The configuration is tricky no doubt. In my case, > > however, I think I had actually fixed the config, I just couldn't tell > that > > I had resolved the issue. I was looking at stale data. > > > > > > On Thu, Feb 7, 2013 at 5:12 PM, Lewis John Mcgibbney < > > [email protected]> wrote: > > > >> So the problem for you is resolved? > >> The main (typical) problem here is in the underlying gora-sql library > and > >> some rather difficult to master gora-sql-mapping.xml constraints. > >> Hope all is resolved > >> Lewis > >> > >> On Thu, Feb 7, 2013 at 1:57 PM, Ward Loving <[email protected]> wrote: > >> > >> > Alright...very good news. I guess something I did fixed the issue. > >> Once I > >> > dropped my webpage table and restarted the process, I'm now getting > >> > complete pages. The actual load of the data to that field can happen > >> > somewhat later than the fetch entry in the logs. Easy to see when > >> > inserting data the first time around. Not as simple to detect when > >> you've > >> > loaded data previously. Thanks for your assistance. > >> > > >> > > >> > > >> > > >> > > >> > > >> > On Thu, Feb 7, 2013 at 3:01 PM, Lewis John Mcgibbney < > >> > [email protected]> wrote: > >> > > >> > > It will prduce more output on the fetcher part of your hadoop.log > not > >> on > >> > > the parsechecker tool itself that is why you are seeing nothing > more. > >> > > Are you still having problems with the truncation aspect? > >> > > Lewis > >> > > > >> > > On Thu, Feb 7, 2013 at 11:07 AM, Ward Loving <[email protected]> > >> wrote: > >> > > > >> > > > Lewis: > >> > > > > >> > > > > >> > > > >> > > >> > > >> > > >> > -- > >> > Ward Loving > >> > Senior Technical Consultant > >> > Appirio, Inc. > >> > www.appirio.com > >> > (706) 225-9475 > >> > > >> > >> > >> > >> -- > >> *Lewis* > >> > > > > > > > > -- > > Ward Loving > > Senior Technical Consultant > > Appirio, Inc. > > www.appirio.com > > (706) 225-9475 > > > > > > -- > Ward Loving > Senior Technical Consultant > Appirio, Inc. > www.appirio.com > (706) 225-9475 > -- *Lewis*

