Hi everyone,

I've found that I'm unable to parse very large XML files. This doesn't seem to 
happen with other file formats. When I run any of the offending files through 
ParserChecker, I get something along the lines of:

# bin/nutch org.apache.nutch.parse.ParserChecker 
http://www.aip.org/history/ead/19990074.xml
---------
Url
---------------
http://www.aip.org/history/ead/19990074.xml---------
ParseData
---------
Version: 5
Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to 
successfully parse content
Title:
Outlinks: 0
Content Metadata:
Parse Metadata:

One thing which may or may not be relevant is that when I look XML files up in 
a browser the http:// at the beginning tends to disappear. That seems relevant 
because it seems like it might defeat my file.content.limit, 
http.content.limit, and ftp.content.limit<ftp://ftp.content.limit> properties. 
Is there a way around this?

Thanks,
Chip

Reply via email to