Hi everyone, I've found that I'm unable to parse very large XML files. This doesn't seem to happen with other file formats. When I run any of the offending files through ParserChecker, I get something along the lines of:
# bin/nutch org.apache.nutch.parse.ParserChecker http://www.aip.org/history/ead/19990074.xml --------- Url --------------- http://www.aip.org/history/ead/19990074.xml--------- ParseData --------- Version: 5 Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content Title: Outlinks: 0 Content Metadata: Parse Metadata: One thing which may or may not be relevant is that when I look XML files up in a browser the http:// at the beginning tends to disappear. That seems relevant because it seems like it might defeat my file.content.limit, http.content.limit, and ftp.content.limit<ftp://ftp.content.limit> properties. Is there a way around this? Thanks, Chip