> Hi everyone,
> 
> I've found that I'm unable to parse very large XML files. This doesn't seem
> to happen with other file formats. When I run any of the offending files
> through ParserChecker, I get something along the lines of:
> 
> # bin/nutch org.apache.nutch.parse.ParserChecker
> http://www.aip.org/history/ead/19990074.xml ---------
> Url
> ---------------
> http://www.aip.org/history/ead/19990074.xml---------
> ParseData
> ---------
> Version: 5
> Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to
> successfully parse content Title:
> Outlinks: 0
> Content Metadata:
> Parse Metadata:
> 
> One thing which may or may not be relevant is that when I look XML files up
> in a browser the http:// at the beginning tends to disappear. 

You're using some fancy new browser? Some seem to do that.  Check your 
http.content.limit.

> That seems
> relevant because it seems like it might defeat my file.content.limit,
> http.content.limit, and ftp.content.limit<ftp://ftp.content.limit>
> properties. Is there a way around this?
> 
> Thanks,
> Chip

Reply via email to