Aha! It turns out that removing "protocol-httpclient" from my nutch-site.xml's plugin.includes value fixes this. If I'm remembering correctly, I only added this in the hope that it would fix something else that it didn't actually fix, so hopefully removing it won't break anything.
-----Original Message----- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, October 18, 2011 9:58 AM To: user@nutch.apache.org Subject: Re: Truncated content despite my content.limit settings. Strange! I parsed it yesterday as well with parse-tike and the Boilerpipe patch enabled and got a lot of output. Can you try a different parser? Your settings look fine but are there any other exoting settings you use or custom code? On Tuesday 18 October 2011 15:53:26 Chip Calhoun wrote: > With ParserChecker it's similarly truncated. Could it be the fact that > it's a .asp page? The output is as follows: > > # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText > http://www.canisius. edu/archives/ruddick.asp > --------- > Url > --------------- > http://www.canisius.edu/archives/ruddick.asp--------- > ParseData > --------- > Version: 5 > Status: success(1,0) > Title: Canisius College - Ruddick Collection > Outlinks: 20 > outlink: toUrl: http://www.canisius.edu/v2/SiteStyleClient.css anchor: > outlink: toUrl: http://www.canisius.edu/v2/SiteStylePrint.css anchor: > outlink: toUrl: http://www.google-analytics.com/urchin.js anchor: > outlink: toUrl: http://www.canisius.edu/default.asp anchor: Return > to Home outlink: toUrl: > http://www.canisius.edu/admissions/rd/?PROP-PROADM > anchor: Adm issions > outlink: toUrl: http://www.canisius.edu/academics/ anchor: Academics > outlink: toUrl: http://www.gogriffs.com anchor: Athletics > outlink: toUrl: http://www.canisius.edu/studentlife/ anchor: Student Life > outlink: toUrl: http://www.canisius.edu/alumnifriends/ anchor: > Alumni and Frie nds > outlink: toUrl: http://www.canisius.edu/newsevents/ anchor: News and > Events outlink: toUrl: > http://www.canisius.edu/images/userImages/creans/Page_12509/ru > ddick_centerBanner.jpg anchor: > outlink: toUrl: > http://www.canisius.edu/images/userImages/creans/Page_12509/ru > ddick_HC.gif anchor: > outlink: toUrl: http://www.canisius.edu/archives/mission.asp anchor: > mission s tatement > outlink: toUrl: > http://www.canisius.edu/images/userImages/creans/Page_12509/mi > ssion_blue.gif anchor: mission statement > outlink: toUrl: http://www.canisius.edu/archives/directory.asp anchor: > archive s directory > outlink: toUrl: > http://www.canisius.edu/images/userImages/creans/Page_12509/ar > chives_gold.gif anchor: archives directory > outlink: toUrl: http://www.canisius.edu/default.asp anchor: Welcome > to Canisiu s > outlink: toUrl: http://www.canisius.edu/about/departments.asp anchor: > Departme nt Index > outlink: toUrl: http://www.canisius.edu/archives/default.asp anchor: > Archives & Special Collections > outlink: toUrl: > http://www.canisius.edu/images/userImages/libweb/Page_12509/Ru > ddick.jpg > anchor: > Content Metadata: Cache-control=private Date=Tue, 18 Oct 2011 13:44:06 > GMT Conte nt-Length=10610 > Set-Cookie=ASPSESSIONIDASSCBRRA=LNGICEKCBKDEAOFICKHLDHEL; path=/ > Content-Type=text/html Connection=close X-Powered-By=ASP.NET > Server=Microsoft-I IS/6.0 Parse Metadata: > CharEncodingForConversion=windows-1252 > OriginalCharEncoding=wind ows-1252 > --------- > ParseText > --------- > Canisius College - Ruddick Collection Canisius College Archives Return > to Home A dmissions Academics Athletics Student Life Alumni and > Friends News and Events We lcome to Canisius á>á Department Index á>á > Archives & Special Collections á>áRud dick Collection Collection of Fr. James > J. > Ruddick, S.J., 1924-2007 Welcome to t he C > > -----Original Message----- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Monday, October 17, 2011 4:26 PM > To: user@nutch.apache.org > Subject: Re: Truncated content despite my content.limit settings. > > What does parsechecker tell you? > > nutch org.apache.nutch.parse.ParserChecker -dumpText <URL> > > Keep in mind that your Solr may have a low value for max field length. > > > Hi everyone, > > > > I'm having issues with truncated content on some pages, despite what > > I believe to be solid content.limit settings. > > > > One page I have an issue with: > > http://www.canisius.edu/archives/ruddick.asp > > > > When I run a search in Solr, the content I get is limited to: > > <str name="content">Canisius College - Ruddick Collection Canisius > > College Archives Return to Home Admissions Academics Athletics > > Student Life Alumni and Friends News and Events Welcome to Canisius > > > Department Index > Archives & Special Collections > Ruddick > > Collection Collection of Fr. James J. Ruddick, S.J., 1924-2007 > > Welcome to the Collection of Rev. James J. Ruddick, S.J. chronicling > > the</str> > > > > Here's what I have in my nutch-site.xml page, which looks sufficient > > to me. <property> > > > > <name>db.max.outlinks.per.page</name> > > <value>-1</value> > > <description>The maximum number of outlinks that we'll process for > > a > > > > page. If this value is nonnegative (>=0), at most > > db.max.outlinks.per.page outlinks will be processed for a page; > > otherwise, all outlinks will be processed. </description> > > </property> <property> > > > > <name>file.content.limit</name> > > <value>-1</value> > > <description>The length limit for downloaded content using the file:// > > protocol, in bytes. If this value is nonnegative (>=0), content longer > > than it will be truncated; otherwise, no truncation at all. Do not > > confuse this setting with the http.content.limit setting. > > </description> > > > > </property> > > <property> > > > > <name>http.content.limit</name> > > <value>-1</value> > > <description>The length limit for downloaded content, in bytes. > > If this value is nonnegative (>=0), content longer than it will be > > truncated; otherwise, no truncation at all. > > </description> > > > > </property> > > <property> > > > > <name>ftp.content.limit</name> > > <value>-1</value> > > <description>The length limit for downloaded content, in bytes. > > If this value is nonnegative (>=0), content longer than it will be > > > > truncated; otherwise, no truncation at all. > > > > Caution: classical ftp RFCs never defines partial transfer and, in > > fact, some ftp servers out there do not handle client side forced > > > > close-down very well. Our implementation tries its best to handle > > such situations smoothly. </description> </property> > > > > Can anyone see what I'm missing? Thanks. > > > > Chip -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350