The actual parse which is producing time outs happens early in the process. 
There are, to my knowledge, no Nutch settings to make this faster or change 
its behaviour, it's all about the parser implementation.

Try increasing your parser.timeout setting.

On Wednesday 26 October 2011 16:45:33 Chip Calhoun wrote:
> I've got a few very large (upwards of 3 MB) XML files I'm trying to index,
> and I'm having trouble. Previously I'd had trouble with the fetch; now
> that seems to be okay, but due to the size of the files the parse takes
> much too long.
> 
> Is there a good way to optimize this that I'm missing? Is lengthy parsing
> of XML a known problem? I recognize that part of my problem is that I'm
> doing my testing from my aging desktop PC, and it will run faster when I
> move things to the server, but it's still slow.
> 
> I do get the following weird message in my log when I run ParserChecker or
> the crawler:
> 
> 2011-10-26 09:51:47,729 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> plugin.includes system property, and all claim to support the content type
> application/xml, but they are not mapped to it  in the parse-plugins.xml
> file 2011-10-26 10:06:40,639 WARN  parse.ParseUtil - TIMEOUT parsing
> http://www.aip.org/history/ead/19990074.xml with
> org.apache.nutch.parse.tika.TikaParser@18355aa 2011-10-26 10:06:40,639
> WARN  parse.ParseUtil - Unable to successfully parse content
> http://www.aip.org/history/ead/19990074.xml of type application/xml
> 
> My ParserChecker results look like this:
> 
> # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText
> http://www.aip.org/history/ead/19990074.xml ---------
> Url
> ---------------
> http://www.aip.org/history/ead/19990074.xml---------
> ParseData
> ---------
> Version: 5
> Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to
> successfully parse content Title:
> Outlinks: 0
> Content Metadata:
> Parse Metadata:
> ---------
> ParseText
> ---------
> 
> And here's everything that might be relevant in my nutch-site.xml; I've
> tried it both with and without the urlmeta plugin, and that doesn't make a
> difference:
>

Reply via email to