> Hi everyone, > > I've found that I'm unable to parse very large XML files. This doesn't seem > to happen with other file formats. When I run any of the offending files > through ParserChecker, I get something along the lines of: > > # bin/nutch org.apache.nutch.parse.ParserChecker > http://www.aip.org/history/ead/19990074.xml --------- > Url > --------------- > http://www.aip.org/history/ead/19990074.xml--------- > ParseData > --------- > Version: 5 > Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to > successfully parse content Title: > Outlinks: 0 > Content Metadata: > Parse Metadata: > > One thing which may or may not be relevant is that when I look XML files up > in a browser the http:// at the beginning tends to disappear.
You're using some fancy new browser? Some seem to do that. Check your http.content.limit. > That seems > relevant because it seems like it might defeat my file.content.limit, > http.content.limit, and ftp.content.limit<ftp://ftp.content.limit> > properties. Is there a way around this? > > Thanks, > Chip

