On 23.04.13 13.00, Karl Wright wrote:
I take back the "no exceptions" comment. We are getting one in the
testhost log:
INFO 2013-04-22 17:39:39,387 (Worker thread '27') - WEB: FETCH
URL|http://www.ibsen.uio.no/REGINFO_peAGa.xhtml?bokstav=G|1366644879398+299979|-1|0|java.net.SocketTimeoutException|
Read timed out
WARN 2013-04-22 17:39:39,387 (Worker thread '27') - Pre-ingest service
interruption reported for job 1360671306324 connection 'web_crawler': Timed out
waiting for IO for 'http://www.ibsen.uio.no/REGINFO_peAGa.xhtml?bokstav=G':
Read timed out
It really does seem to be a socket timeout. It looks like it was able to
establish a connection, but then waited 5 minutes for any data to appear. Can
you fetch this URL without problem using the same headers - esp. the User-Agent
header? It may be that your crawler is being blocked by this site.
-bash-3.2$ curl -vvv -H "User-Agent: Mozilla/5.0
(ApacheManifoldCFWebCrawler; [email protected])"
"http://www.ibsen.uio.no/REGINFO_peAGa.xhtml?bokstav=G|1366644879398+299979"
* About to connect() to www.ibsen.uio.no port 80
* Trying 129.240.7.27... connected
* Connected to www.ibsen.uio.no (129.240.7.27) port 80
> GET /REGINFO_peAGa.xhtml?bokstav=G|1366644879398+299979 HTTP/1.1
> Host: www.ibsen.uio.no
> Accept: */*
> User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler;
[email protected])
>
< HTTP/1.1 200 OK
< Date: Tue, 23 Apr 2013 11:45:10 GMT
< Server: Apache-Coyote/1.1
< X-Cocoon-Version: 2.1.12-dev
< Content-Type: text/html
< Transfer-Encoding: chunked
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
...
This Curl command was run on the same test server. Seems to work as it
should.
Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050