On 23.04.13 13.00, Karl Wright wrote:
I take back the "no exceptions" comment.  We are getting one in the
testhost log:

  INFO 2013-04-22 17:39:39,387 (Worker thread '27') - WEB: FETCH 
URL|http://www.ibsen.uio.no/REGINFO_peAGa.xhtml?bokstav=G|1366644879398+299979|-1|0|java.net.SocketTimeoutException|
  Read timed out
  WARN 2013-04-22 17:39:39,387 (Worker thread '27') - Pre-ingest service 
interruption reported for job 1360671306324 connection 'web_crawler': Timed out 
waiting for IO for 'http://www.ibsen.uio.no/REGINFO_peAGa.xhtml?bokstav=G': 
Read timed out

It really does seem to be a socket timeout.  It looks like it was able to 
establish a connection, but then waited 5 minutes for any data to appear.  Can 
you fetch this URL without problem using the same headers - esp. the User-Agent 
header?  It may be that your crawler is being blocked by this site.

-bash-3.2$ curl -vvv -H "User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; [email protected])" "http://www.ibsen.uio.no/REGINFO_peAGa.xhtml?bokstav=G|1366644879398+299979"
* About to connect() to www.ibsen.uio.no port 80
*   Trying 129.240.7.27... connected
* Connected to www.ibsen.uio.no (129.240.7.27) port 80
> GET /REGINFO_peAGa.xhtml?bokstav=G|1366644879398+299979 HTTP/1.1
> Host: www.ibsen.uio.no
> Accept: */*
> User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; [email protected])
>
< HTTP/1.1 200 OK
< Date: Tue, 23 Apr 2013 11:45:10 GMT
< Server: Apache-Coyote/1.1
< X-Cocoon-Version: 2.1.12-dev
< Content-Type: text/html
< Transfer-Encoding: chunked
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
...

This Curl command was run on the same test server. Seems to work as it should.

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to