Aha! It turns out that removing "protocol-httpclient" from my nutch-site.xml's 
plugin.includes value fixes this. If I'm remembering correctly, I only added 
this in the hope that it would fix something else that it didn't actually fix, 
so hopefully removing it won't break anything.

-----Original Message-----
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Tuesday, October 18, 2011 9:58 AM
To: user@nutch.apache.org
Subject: Re: Truncated content despite my content.limit settings.

Strange! I parsed it yesterday as well with parse-tike and the Boilerpipe patch 
enabled and got a lot of output. Can you try a different parser? Your settings 
look fine but are there any other exoting settings you use or custom code?

On Tuesday 18 October 2011 15:53:26 Chip Calhoun wrote:
> With ParserChecker it's similarly truncated. Could it be the fact that 
> it's a .asp page? The output is as follows:
> 
> # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
> http://www.canisius. edu/archives/ruddick.asp
> ---------
> Url
> ---------------
> http://www.canisius.edu/archives/ruddick.asp---------
> ParseData
> ---------
> Version: 5
> Status: success(1,0)
> Title: Canisius College - Ruddick Collection
> Outlinks: 20
>   outlink: toUrl: http://www.canisius.edu/v2/SiteStyleClient.css anchor:
>   outlink: toUrl: http://www.canisius.edu/v2/SiteStylePrint.css anchor:
>   outlink: toUrl: http://www.google-analytics.com/urchin.js anchor:
>   outlink: toUrl: http://www.canisius.edu/default.asp anchor: Return 
> to Home outlink: toUrl: 
> http://www.canisius.edu/admissions/rd/?PROP-PROADM
> anchor: Adm issions
>   outlink: toUrl: http://www.canisius.edu/academics/ anchor: Academics
>   outlink: toUrl: http://www.gogriffs.com anchor: Athletics
>   outlink: toUrl: http://www.canisius.edu/studentlife/ anchor: Student Life
>   outlink: toUrl: http://www.canisius.edu/alumnifriends/ anchor: 
> Alumni and Frie nds
>   outlink: toUrl: http://www.canisius.edu/newsevents/ anchor: News and 
> Events outlink: toUrl:
> http://www.canisius.edu/images/userImages/creans/Page_12509/ru
> ddick_centerBanner.jpg anchor:
>   outlink: toUrl:
> http://www.canisius.edu/images/userImages/creans/Page_12509/ru
> ddick_HC.gif anchor:
>   outlink: toUrl: http://www.canisius.edu/archives/mission.asp anchor:
> mission s tatement
>   outlink: toUrl:
> http://www.canisius.edu/images/userImages/creans/Page_12509/mi
> ssion_blue.gif anchor: mission statement
>   outlink: toUrl: http://www.canisius.edu/archives/directory.asp anchor:
> archive s directory
>   outlink: toUrl:
> http://www.canisius.edu/images/userImages/creans/Page_12509/ar
> chives_gold.gif anchor: archives directory
>   outlink: toUrl: http://www.canisius.edu/default.asp anchor: Welcome 
> to Canisiu s
>   outlink: toUrl: http://www.canisius.edu/about/departments.asp anchor:
> Departme nt Index
>   outlink: toUrl: http://www.canisius.edu/archives/default.asp anchor:
> Archives & Special Collections
>   outlink: toUrl:
> http://www.canisius.edu/images/userImages/libweb/Page_12509/Ru 
> ddick.jpg
> anchor:
> Content Metadata: Cache-control=private Date=Tue, 18 Oct 2011 13:44:06 
> GMT Conte nt-Length=10610 
> Set-Cookie=ASPSESSIONIDASSCBRRA=LNGICEKCBKDEAOFICKHLDHEL; path=/ 
> Content-Type=text/html Connection=close X-Powered-By=ASP.NET 
> Server=Microsoft-I IS/6.0 Parse Metadata: 
> CharEncodingForConversion=windows-1252
> OriginalCharEncoding=wind ows-1252
> ---------
> ParseText
> ---------
> Canisius College - Ruddick Collection Canisius College Archives Return 
> to Home A dmissions Academics Athletics Student Life Alumni and 
> Friends News and Events We lcome to Canisius á>á Department Index á>á 
> Archives & Special Collections á>áRud dick Collection Collection of Fr. James 
> J.
> Ruddick, S.J., 1924-2007 Welcome to t he C
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Monday, October 17, 2011 4:26 PM
> To: user@nutch.apache.org
> Subject: Re: Truncated content despite my content.limit settings.
> 
> What does parsechecker tell you?
> 
> nutch org.apache.nutch.parse.ParserChecker -dumpText <URL>
> 
> Keep in mind that your Solr may have a low value for max field length.
> 
> > Hi everyone,
> > 
> > I'm having issues with truncated content on some pages, despite what 
> > I believe to be solid content.limit settings.
> > 
> > One page I have an issue with:
> > http://www.canisius.edu/archives/ruddick.asp
> > 
> > When I run a search in Solr, the content I get is limited to:
> > <str name="content">Canisius College - Ruddick Collection Canisius 
> > College Archives Return to Home Admissions Academics Athletics 
> > Student Life Alumni and Friends News and Events Welcome to Canisius  
> > > Department Index  > Archives & Special Collections  > Ruddick 
> > Collection Collection of Fr. James J. Ruddick, S.J., 1924-2007 
> > Welcome to the Collection of Rev. James J. Ruddick, S.J. chronicling 
> > the</str>
> > 
> > Here's what I have in my nutch-site.xml page, which looks sufficient 
> > to me. <property>
> > 
> >   <name>db.max.outlinks.per.page</name>
> >   <value>-1</value>
> >   <description>The maximum number of outlinks that we'll process for 
> > a
> > 
> > page. If this value is nonnegative (>=0), at most 
> > db.max.outlinks.per.page outlinks will be processed for a page; 
> > otherwise, all outlinks will be processed. </description> 
> > </property> <property>
> > 
> >   <name>file.content.limit</name>
> >   <value>-1</value>
> >   <description>The length limit for downloaded content using the file://
> >   protocol, in bytes. If this value is nonnegative (>=0), content longer
> >   than it will be truncated; otherwise, no truncation at all. Do not
> >   confuse this setting with the http.content.limit setting.
> >   </description>
> > 
> > </property>
> > <property>
> > 
> >   <name>http.content.limit</name>
> >   <value>-1</value>
> >   <description>The length limit for downloaded content, in bytes.
> >   If this value is nonnegative (>=0), content longer than it will be
> >   truncated; otherwise, no truncation at all.
> >   </description>
> > 
> > </property>
> > <property>
> > 
> >   <name>ftp.content.limit</name>
> >   <value>-1</value>
> >   <description>The length limit for downloaded content, in bytes.
> >   If this value is nonnegative (>=0), content longer than it will be
> > 
> > truncated; otherwise, no truncation at all.
> > 
> >   Caution: classical ftp RFCs never defines partial transfer and, in
> >   fact, some ftp servers out there do not handle client side forced
> > 
> > close-down very well. Our implementation tries its best to handle 
> > such situations smoothly. </description> </property>
> > 
> > Can anyone see what I'm missing? Thanks.
> > 
> > Chip

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to