Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

2011-10-26 Thread Chip Calhoun
for timeout? Good to know! I was definitely exceeding that, so I've changed my properties. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, October 20, 2011 10:00 AM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: Good workaround

Re: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

2011-10-26 Thread Markus Jelsma
The actual parse which is producing time outs happens early in the process. There are, to my knowledge, no Nutch settings to make this faster or change its behaviour, it's all about the parser implementation. Try increasing your parser.timeout setting. On Wednesday 26 October 2011 16:45:33

RE: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

2011-10-26 Thread Chip Calhoun
parsing of large XML files (Was RE: Good workaround for timeout?) The actual parse which is producing time outs happens early in the process. There are, to my knowledge, no Nutch settings to make this faster or change its behaviour, it's all about the parser implementation. Try increasing your

RE: Good workaround for timeout?

2011-10-20 Thread Chip Calhoun
, 2011 4:57 PM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: Good workaround for timeout? I'm using protocol-http, but I removed protocol-httpclient after you pointed out in another thread that it's broken. Unfortunately I'm not sure which properties are used by what, and I'm not sure

Re: Good workaround for timeout?

2011-10-20 Thread Markus Jelsma
Integer.MAX_VALUE. Don't know how hadoop will handle for sure. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 4:57 PM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: Good workaround for timeout? I'm using protocol

RE: Good workaround for timeout?

2011-10-20 Thread Chip Calhoun
Good to know! I was definitely exceeding that, so I've changed my properties. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, October 20, 2011 10:00 AM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: Good workaround for timeout

Re: Good workaround for timeout?

2011-10-19 Thread Markus Jelsma
What is timing out, the fetch or the parse? I'm getting a fairly persistent timeout on a particular page. Other, smaller pages in this folder do fine, but this one times out most of the time. When it fails, my ParserChecker results look like: # bin/nutch

RE: Good workaround for timeout?

2011-10-19 Thread Chip Calhoun
: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 11:08 AM To: user@nutch.apache.org Subject: Re: Good workaround for timeout? What is timing out, the fetch or the parse? I'm getting a fairly persistent timeout on a particular page. Other, smaller pages

Re: Good workaround for timeout?

2011-10-19 Thread Markus Jelsma
Subject: Re: Good workaround for timeout? What is timing out, the fetch or the parse? I'm getting a fairly persistent timeout on a particular page. Other, smaller pages in this folder do fine, but this one times out most of the time. When it fails, my ParserChecker results look like

RE: Good workaround for timeout?

2011-10-19 Thread Chip Calhoun
because of a very long or corrupted document. /description /property -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 11:28 AM To: user@nutch.apache.org Subject: Re: Good workaround for timeout? It is indeed. Tricky. Are you going