Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

Chip Calhoun Wed, 26 Oct 2011 07:46:06 -0700

I've got a few very large (upwards of 3 MB) XML files I'm trying to index, and 
I'm having trouble. Previously I'd had trouble with the fetch; now that seems 
to be okay, but due to the size of the files the parse takes much too long.


Is there a good way to optimize this that I'm missing? Is lengthy parsing of 
XML a known problem? I recognize that part of my problem is that I'm doing my 
testing from my aging desktop PC, and it will run faster when I move things to 
the server, but it's still slow.

I do get the following weird message in my log when I run ParserChecker or the 
crawler:

2011-10-26 09:51:47,729 INFO  parse.ParserFactory - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and all claim to support the content type application/xml, but 
they are not mapped to it  in the parse-plugins.xml file
2011-10-26 10:06:40,639 WARN  parse.ParseUtil - TIMEOUT parsing 
http://www.aip.org/history/ead/19990074.xml with 
org.apache.nutch.parse.tika.TikaParser@18355aa
2011-10-26 10:06:40,639 WARN  parse.ParseUtil - Unable to successfully parse 
content http://www.aip.org/history/ead/19990074.xml of type application/xml

My ParserChecker results look like this:

# bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
http://www.aip.org/history/ead/19990074.xml
---------
Url
---------------
http://www.aip.org/history/ead/19990074.xml---------
ParseData
---------
Version: 5
Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to 
successfully parse content
Title:
Outlinks: 0
Content Metadata:
Parse Metadata:
---------
ParseText
---------

And here's everything that might be relevant in my nutch-site.xml; I've tried 
it both with and without the urlmeta plugin, and that doesn't make a difference:

<property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
 </property> 
<property>
  <name>file.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the file://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the http.content.limit setting.
  </description>
</property>
<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be 
  truncated; otherwise, no truncation at all.
  </description>
 </property>
<property>
  <name>ftp.content.limit</name>
  <value>-1</value> 
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  Caution: classical ftp RFCs never defines partial transfer and, in fact,
  some ftp servers out there do not handle client side forced close-down very
  well. Our implementation tries its best to handle such situations smoothly.
  </description>
</property>
<property>
  <name>http.timeout</name>
  <value>4294967290</value>
  <description>The default network timeout, in milliseconds.</description>
</property>
<property>
  <name>ftp.timeout</name>
  <value>4294967290</value>
  <description>Default timeout for ftp client socket, in millisec.
  Please also see ftp.keep.connection below.</description>
</property>
<property>
  <name>ftp.server.timeout</name>
  <value>4294967290</value>
  <description>An estimation of ftp server idle time, in millisec.
  Typically it is 120000 millisec for many ftp servers out there.
  Better be conservative here. Together with ftp.timeout, it is used to
  decide if we need to delete (annihilate) current ftp.client instance and
  force to start another ftp.client instance anew. This is necessary because
  a fetcher thread may not be able to obtain next request from queue in time
  (due to idleness) before our ftp client times out or remote server
  disconnects. Used only when ftp.keep.connection is true (please see below).
  </description>
</property>
<property>
  <name>parser.timeout</name>
  <value>900</value>
  <description>Timeout in seconds for the parsing of a document, otherwise 
treats it as an exception and 
  moves on the the following documents. This parameter is applied to any Parser 
implementation. 
  Set to -1 to deactivate, bearing in mind that this could cause
  the parsing to crash because of a very long or corrupted document.
  </description>
</property>
<property>
  <name>fetcher.threads.fetch</name>
  <value>1</value>
  <description>The number of FetcherThreads the fetcher should use.
    This is also determines the maximum number of requests that are 
    made at once (each FetcherThread handles one connection).</description>
</property>
 <property>
  <name>plugin.includes</name>
  
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)|urlmeta</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
 </property>
 <property>
  <name>urlmeta.tags</name>
  <value>humanurl</value>
 </property>




-----Original Message-----
From: Chip Calhoun [mailto:[email protected]] 
Sent: Thursday, October 20, 2011 10:23 AM
To: '[email protected]'; [email protected]
Subject: RE: Good workaround for timeout?

Good to know! I was definitely exceeding that, so I've changed my properties.

-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Thursday, October 20, 2011 10:00 AM
To: [email protected]
Cc: Chip Calhoun
Subject: Re: Good workaround for timeout?



On Thursday 20 October 2011 15:56:01 Chip Calhoun wrote:
> I started out with a pretty high number in http.timeout, and I've 
> increased it to the fairly ridiculous 99999999999. Is there an upper 
> limit at which it would stop working properly?

It's interpreted as an Integer so don't exceed Integer.MAX_VALUE. Don't know 
how hadoop will handle for sure.

> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: Wednesday, October 19, 2011 4:57 PM
> To: [email protected]
> Cc: Chip Calhoun
> Subject: Re: Good workaround for timeout?
> 
> > I'm using protocol-http, but I removed protocol-httpclient after you 
> > pointed out in another thread that it's broken. Unfortunately I'm 
> > not sure which properties are used by what, and I'm not sure how to 
> > find out. I added some more stuff to nutch-site.xml (I'll paste it 
> > at the end), and it seems to be working so far; but since this has 
> > been an intermittent problem, I can't be sure whether I've really 
> > fixed it or whether I'm getting lucky.
> 
> http.timeout is used in lib-http so it should work unless there's a 
> bug around. Does the problem persist for that one URL if you increase 
> this value to a more reasonable number, say 300?
> 
> > <property>
> > 
> >   <name>http.timeout</name>
> >   <value>99999999999</value>
> >   <description>The default network timeout, in
> > 
> > milliseconds.</description> </property> <property>
> > 
> >   <name>ftp.timeout</name>
> >   <value>9999999999</value>
> >   <description>Default timeout for ftp client socket, in millisec.
> >   Please also see ftp.keep.connection below.</description> 
> > </property>
> > 
> > <property>
> > 
> >   <name>ftp.server.timeout</name>
> >   <value>99999999999999999</value>
> >   <description>An estimation of ftp server idle time, in millisec.
> >   Typically it is 120000 millisec for many ftp servers out there.
> >   Better be conservative here. Together with ftp.timeout, it is used to
> >   decide if we need to delete (annihilate) current ftp.client instance
> >   and force to start another ftp.client instance anew. This is 
> > necessary
> > 
> > because a fetcher thread may not be able to obtain next request from 
> > queue in time (due to idleness) before our ftp client times out or 
> > remote server disconnects. Used only when ftp.keep.connection is 
> > true (please see below). </description> </property> <property>
> > 
> >   <name>parser.timeout</name>
> >   <value>300</value>
> >   <description>Timeout in seconds for the parsing of a document,
> > 
> > otherwise treats it as an exception and moves on the the following 
> > documents. This parameter is applied to any Parser implementation. 
> > Set to -1 to deactivate, bearing in mind that this could cause
> > 
> >   the parsing to crash because of a very long or corrupted document.
> >   </description>
> > 
> > </property>
> > 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:[email protected]]
> > Sent: Wednesday, October 19, 2011 11:28 AM
> > To: [email protected]
> > Subject: Re: Good workaround for timeout?
> > 
> > It is indeed. Tricky.
> > 
> > Are you going through some proxy? Are you using protocol-http or 
> > httpclient? Are you sure the http.time.out value is actually used in 
> > lib-http?
> > 
> > > If I'm reading the log correctly, it's the fetch:
> > > 
> > > 2011-10-19 11:18:11,405 INFO  fetcher.Fetcher - fetch of
> > > http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2
> > > 93 2D onal dsonLauren.xml failed with: 
> > > java.net.SocketTimeoutException:
> > > Read timed out
> > > 
> > > 
> > > -----Original Message-----
> > > From: Markus Jelsma [mailto:[email protected]]
> > > Sent: Wednesday, October 19, 2011 11:08 AM
> > > To: [email protected]
> > > Subject: Re: Good workaround for timeout?
> > > 
> > > What is timing out, the fetch or the parse?
> > > 
> > > > I'm getting a fairly persistent  timeout on a particular page.
> > > > Other, smaller pages in this folder do fine, but this one times 
> > > > out most of the time. When it fails, my ParserChecker results 
> > > > look
> > > > like:
> > > > 
> > > > # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText
> > > > http://digital.lib.washington.edu/findingaids/view?docId=UA37_06
> > > > _2
> > > > 93 2D onal dsonLauren.xml Exception in thread "main"
> > > > java.lang.NullPointerException
> > > > 
> > > >         at
> > > > 
> > > > org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
> > > > 
> > > > I've stuck with the default value of "10" in my 
> > > > nutch-default.xml's fetcher.threads.fetch value, and I've added 
> > > > the following to
> > > > nutch-site.xml:
> > > > 
> > > > <property>
> > > > 
> > > >   <name>db.max.outlinks.per.page</name>
> > > >   <value>-1</value>
> > > >   <description>The maximum number of outlinks that we'll process
> > > > 
> > > > for
> > > > 
> > > > a
> > > > 
> > > > page. If this value is nonnegative (>=0), at most 
> > > > db.max.outlinks.per.page outlinks will be processed for a page; 
> > > > otherwise, all outlinks will be processed. </description> 
> > > > </property> <property>
> > > > 
> > > >   <name>file.content.limit</name>
> > > >   <value>-1</value>
> > > >   <description>The length limit for downloaded content using the
> > > >   file:// protocol, in bytes. If this value is nonnegative (>=0),
> > > >   content longer than it will be truncated; otherwise, no truncation
> > > >   at all. Do not confuse this setting with the http.content.limit
> > > >   setting.
> > > >   </description>
> > > > 
> > > > </property>
> > > > <property>
> > > > 
> > > >   <name>http.content.limit</name>
> > > >   <value>-1</value>
> > > >   <description>The length limit for downloaded content, in bytes.
> > > >   If this value is nonnegative (>=0), content longer than it will be
> > > >   truncated; otherwise, no truncation at all.
> > > >   </description>
> > > > 
> > > > </property>
> > > > <property>
> > > > 
> > > >   <name>ftp.content.limit</name>
> > > >   <value>-1</value>
> > > >   <description>The length limit for downloaded content, in bytes.
> > > >   If this value is nonnegative (>=0), content longer than it 
> > > > will
> > > > 
> > > > be
> > > > 
> > > > truncated; otherwise, no truncation at all.
> > > > 
> > > >   Caution: classical ftp RFCs never defines partial transfer and, in
> > > >   fact, some ftp servers out there do not handle client side
> > > > 
> > > > forced
> > > > 
> > > > close-down very well. Our implementation tries its best to 
> > > > handle such situations smoothly. </description> </property> 
> > > > <property>
> > > > 
> > > >   <name>http.timeout</name>
> > > >   <value>99999999999</value>
> > > >   <description>The default network timeout, in
> > > > 
> > > > milliseconds.</description> </property>
> > > > 
> > > > What else can I do? Thanks.
> > > > 
> > > > Chip

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

Reply via email to