Hrm. No, it turns out I was wrong; I'd misread an error message. I've got the following in my nutch-site.xml:
<property> <name>file.content.limit</name> <value>-1</value> <description>The length limit for downloaded content using the file:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. </description> </property> <property> <name>http.content.limit</name> <value>-1</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. </description> </property> <property> <name>ftp.content.limit</name> <value>-1</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. </description> </property> -----Original Message----- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: Wednesday, October 05, 2011 9:34 AM To: 'user@nutch.apache.org'; 'markus.jel...@openindex.io' Subject: RE: Unable to parse large XML files. Huh. It turns out my http.content.limit was fine, but I also needed a file.content.limit statement in nutch-site.xml to make this work. Thanks! -----Original Message----- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, October 04, 2011 7:41 PM To: user@nutch.apache.org Subject: Re: Unable to parse large XML files. > Hi everyone, > > I've found that I'm unable to parse very large XML files. This doesn't > seem to happen with other file formats. When I run any of the > offending files through ParserChecker, I get something along the lines of: > > # bin/nutch org.apache.nutch.parse.ParserChecker > http://www.aip.org/history/ead/19990074.xml --------- Url > --------------- > http://www.aip.org/history/ead/19990074.xml--------- > ParseData > --------- > Version: 5 > Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable > to successfully parse content Title: > Outlinks: 0 > Content Metadata: > Parse Metadata: > > One thing which may or may not be relevant is that when I look XML > files up in a browser the http:// at the beginning tends to disappear. You're using some fancy new browser? Some seem to do that. Check your http.content.limit. > That seems > relevant because it seems like it might defeat my file.content.limit, > http.content.limit, and ftp.content.limit<ftp://ftp.content.limit> > properties. Is there a way around this? > > Thanks, > Chip