RE: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

Chip Calhoun Wed, 26 Oct 2011 09:50:00 -0700

Increasing parser.timeout to 3600 got me what I needed. I only have a few files 
this huge, so I'll live with that.


-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Wednesday, October 26, 2011 10:55 AM
To: [email protected]
Subject: Re: Extremely long parsing of large XML files (Was RE: Good workaround 
for timeout?)

The actual parse which is producing time outs happens early in the process. 
There are, to my knowledge, no Nutch settings to make this faster or change its 
behaviour, it's all about the parser implementation.

Try increasing your parser.timeout setting.

On Wednesday 26 October 2011 16:45:33 Chip Calhoun wrote:
> I've got a few very large (upwards of 3 MB) XML files I'm trying to 
> index, and I'm having trouble. Previously I'd had trouble with the 
> fetch; now that seems to be okay, but due to the size of the files the 
> parse takes much too long.
> 
> Is there a good way to optimize this that I'm missing? Is lengthy 
> parsing of XML a known problem? I recognize that part of my problem is 
> that I'm doing my testing from my aging desktop PC, and it will run 
> faster when I move things to the server, but it's still slow.
> 
> I do get the following weird message in my log when I run 
> ParserChecker or the crawler:
> 
> 2011-10-26 09:51:47,729 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.TikaParser] are enabled via the 
> plugin.includes system property, and all claim to support the content 
> type application/xml, but they are not mapped to it  in the 
> parse-plugins.xml file 2011-10-26 10:06:40,639 WARN  parse.ParseUtil - 
> TIMEOUT parsing http://www.aip.org/history/ead/19990074.xml with 
> org.apache.nutch.parse.tika.TikaParser@18355aa 2011-10-26 10:06:40,639 
> WARN  parse.ParseUtil - Unable to successfully parse content 
> http://www.aip.org/history/ead/19990074.xml of type application/xml
> 
> My ParserChecker results look like this:
> 
> # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
> http://www.aip.org/history/ead/19990074.xml --------- Url
> ---------------
> http://www.aip.org/history/ead/19990074.xml---------
> ParseData
> ---------
> Version: 5
> Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable 
> to successfully parse content Title:
> Outlinks: 0
> Content Metadata:
> Parse Metadata:
> ---------
> ParseText
> ---------
> 
> And here's everything that might be relevant in my nutch-site.xml; 
> I've tried it both with and without the urlmeta plugin, and that 
> doesn't make a
> difference:
>

RE: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

Reply via email to