I've got a few very large (upwards of 3 MB) XML files I'm trying to index, and I'm having trouble. Previously I'd had trouble with the fetch; now that seems to be okay, but due to the size of the files the parse takes much too long.
Is there a good way to optimize this that I'm missing? Is lengthy parsing of XML a known problem? I recognize that part of my problem is that I'm doing my testing from my aging desktop PC, and it will run faster when I move things to the server, but it's still slow. I do get the following weird message in my log when I run ParserChecker or the crawler: 2011-10-26 09:51:47,729 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/xml, but they are not mapped to it in the parse-plugins.xml file 2011-10-26 10:06:40,639 WARN parse.ParseUtil - TIMEOUT parsing http://www.aip.org/history/ead/19990074.xml with org.apache.nutch.parse.tika.TikaParser@18355aa 2011-10-26 10:06:40,639 WARN parse.ParseUtil - Unable to successfully parse content http://www.aip.org/history/ead/19990074.xml of type application/xml My ParserChecker results look like this: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://www.aip.org/history/ead/19990074.xml --------- Url --------------- http://www.aip.org/history/ead/19990074.xml--------- ParseData --------- Version: 5 Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content Title: Outlinks: 0 Content Metadata: Parse Metadata: --------- ParseText --------- And here's everything that might be relevant in my nutch-site.xml; I've tried it both with and without the urlmeta plugin, and that doesn't make a difference: <property> <name>db.max.outlinks.per.page</name> <value>-1</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> <property> <name>file.content.limit</name> <value>-1</value> <description>The length limit for downloaded content using the file:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. </description> </property> <property> <name>http.content.limit</name> <value>-1</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. </description> </property> <property> <name>ftp.content.limit</name> <value>-1</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. </description> </property> <property> <name>http.timeout</name> <value>4294967290</value> <description>The default network timeout, in milliseconds.</description> </property> <property> <name>ftp.timeout</name> <value>4294967290</value> <description>Default timeout for ftp client socket, in millisec. Please also see ftp.keep.connection below.</description> </property> <property> <name>ftp.server.timeout</name> <value>4294967290</value> <description>An estimation of ftp server idle time, in millisec. Typically it is 120000 millisec for many ftp servers out there. Better be conservative here. Together with ftp.timeout, it is used to decide if we need to delete (annihilate) current ftp.client instance and force to start another ftp.client instance anew. This is necessary because a fetcher thread may not be able to obtain next request from queue in time (due to idleness) before our ftp client times out or remote server disconnects. Used only when ftp.keep.connection is true (please see below). </description> </property> <property> <name>parser.timeout</name> <value>900</value> <description>Timeout in seconds for the parsing of a document, otherwise treats it as an exception and moves on the the following documents. This parameter is applied to any Parser implementation. Set to -1 to deactivate, bearing in mind that this could cause the parsing to crash because of a very long or corrupted document. </description> </property> <property> <name>fetcher.threads.fetch</name> <value>1</value> <description>The number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection).</description> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)|urlmeta</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property> <property> <name>urlmeta.tags</name> <value>humanurl</value> </property> -----Original Message----- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: Thursday, October 20, 2011 10:23 AM To: 'markus.jel...@openindex.io'; user@nutch.apache.org Subject: RE: Good workaround for timeout? Good to know! I was definitely exceeding that, so I've changed my properties. -----Original Message----- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, October 20, 2011 10:00 AM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: Good workaround for timeout? On Thursday 20 October 2011 15:56:01 Chip Calhoun wrote: > I started out with a pretty high number in http.timeout, and I've > increased it to the fairly ridiculous 99999999999. Is there an upper > limit at which it would stop working properly? It's interpreted as an Integer so don't exceed Integer.MAX_VALUE. Don't know how hadoop will handle for sure. > > -----Original Message----- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Wednesday, October 19, 2011 4:57 PM > To: user@nutch.apache.org > Cc: Chip Calhoun > Subject: Re: Good workaround for timeout? > > > I'm using protocol-http, but I removed protocol-httpclient after you > > pointed out in another thread that it's broken. Unfortunately I'm > > not sure which properties are used by what, and I'm not sure how to > > find out. I added some more stuff to nutch-site.xml (I'll paste it > > at the end), and it seems to be working so far; but since this has > > been an intermittent problem, I can't be sure whether I've really > > fixed it or whether I'm getting lucky. > > http.timeout is used in lib-http so it should work unless there's a > bug around. Does the problem persist for that one URL if you increase > this value to a more reasonable number, say 300? > > > <property> > > > > <name>http.timeout</name> > > <value>99999999999</value> > > <description>The default network timeout, in > > > > milliseconds.</description> </property> <property> > > > > <name>ftp.timeout</name> > > <value>9999999999</value> > > <description>Default timeout for ftp client socket, in millisec. > > Please also see ftp.keep.connection below.</description> > > </property> > > > > <property> > > > > <name>ftp.server.timeout</name> > > <value>99999999999999999</value> > > <description>An estimation of ftp server idle time, in millisec. > > Typically it is 120000 millisec for many ftp servers out there. > > Better be conservative here. Together with ftp.timeout, it is used to > > decide if we need to delete (annihilate) current ftp.client instance > > and force to start another ftp.client instance anew. This is > > necessary > > > > because a fetcher thread may not be able to obtain next request from > > queue in time (due to idleness) before our ftp client times out or > > remote server disconnects. Used only when ftp.keep.connection is > > true (please see below). </description> </property> <property> > > > > <name>parser.timeout</name> > > <value>300</value> > > <description>Timeout in seconds for the parsing of a document, > > > > otherwise treats it as an exception and moves on the the following > > documents. This parameter is applied to any Parser implementation. > > Set to -1 to deactivate, bearing in mind that this could cause > > > > the parsing to crash because of a very long or corrupted document. > > </description> > > > > </property> > > > > -----Original Message----- > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > Sent: Wednesday, October 19, 2011 11:28 AM > > To: user@nutch.apache.org > > Subject: Re: Good workaround for timeout? > > > > It is indeed. Tricky. > > > > Are you going through some proxy? Are you using protocol-http or > > httpclient? Are you sure the http.time.out value is actually used in > > lib-http? > > > > > If I'm reading the log correctly, it's the fetch: > > > > > > 2011-10-19 11:18:11,405 INFO fetcher.Fetcher - fetch of > > > http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2 > > > 93 2D onal dsonLauren.xml failed with: > > > java.net.SocketTimeoutException: > > > Read timed out > > > > > > > > > -----Original Message----- > > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > > Sent: Wednesday, October 19, 2011 11:08 AM > > > To: user@nutch.apache.org > > > Subject: Re: Good workaround for timeout? > > > > > > What is timing out, the fetch or the parse? > > > > > > > I'm getting a fairly persistent timeout on a particular page. > > > > Other, smaller pages in this folder do fine, but this one times > > > > out most of the time. When it fails, my ParserChecker results > > > > look > > > > like: > > > > > > > > # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText > > > > http://digital.lib.washington.edu/findingaids/view?docId=UA37_06 > > > > _2 > > > > 93 2D onal dsonLauren.xml Exception in thread "main" > > > > java.lang.NullPointerException > > > > > > > > at > > > > > > > > org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) > > > > > > > > I've stuck with the default value of "10" in my > > > > nutch-default.xml's fetcher.threads.fetch value, and I've added > > > > the following to > > > > nutch-site.xml: > > > > > > > > <property> > > > > > > > > <name>db.max.outlinks.per.page</name> > > > > <value>-1</value> > > > > <description>The maximum number of outlinks that we'll process > > > > > > > > for > > > > > > > > a > > > > > > > > page. If this value is nonnegative (>=0), at most > > > > db.max.outlinks.per.page outlinks will be processed for a page; > > > > otherwise, all outlinks will be processed. </description> > > > > </property> <property> > > > > > > > > <name>file.content.limit</name> > > > > <value>-1</value> > > > > <description>The length limit for downloaded content using the > > > > file:// protocol, in bytes. If this value is nonnegative (>=0), > > > > content longer than it will be truncated; otherwise, no truncation > > > > at all. Do not confuse this setting with the http.content.limit > > > > setting. > > > > </description> > > > > > > > > </property> > > > > <property> > > > > > > > > <name>http.content.limit</name> > > > > <value>-1</value> > > > > <description>The length limit for downloaded content, in bytes. > > > > If this value is nonnegative (>=0), content longer than it will be > > > > truncated; otherwise, no truncation at all. > > > > </description> > > > > > > > > </property> > > > > <property> > > > > > > > > <name>ftp.content.limit</name> > > > > <value>-1</value> > > > > <description>The length limit for downloaded content, in bytes. > > > > If this value is nonnegative (>=0), content longer than it > > > > will > > > > > > > > be > > > > > > > > truncated; otherwise, no truncation at all. > > > > > > > > Caution: classical ftp RFCs never defines partial transfer and, in > > > > fact, some ftp servers out there do not handle client side > > > > > > > > forced > > > > > > > > close-down very well. Our implementation tries its best to > > > > handle such situations smoothly. </description> </property> > > > > <property> > > > > > > > > <name>http.timeout</name> > > > > <value>99999999999</value> > > > > <description>The default network timeout, in > > > > > > > > milliseconds.</description> </property> > > > > > > > > What else can I do? Thanks. > > > > > > > > Chip -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350