Timeout problems with web crawling

Erlend Garåsen Tue, 23 Apr 2013 03:23:28 -0700

I'm still having problems with web crawling using trunk with updatedHttp client. It seems that the problems occur when Solr is passwordprotected even though the error messages in my logs indicate a timeoutproblem. I'm not 100 % sure, but it seems that the problem starts assoon as I'm enabling password protection.

We have struggled a lot with the web crawler in production moderecently, but I thought that we managed to get around these problemswhen "expect 100 continue" was added to the header (now added in trunk).Then we discovered a Resin bug which sent a wrong http status code backwhen this header was enabled, but this has been solved by moving theauthentication configuration to Apache HTTP server instead (using.htaccess). So everything *should* work, but it doesn't. Now I havemanaged to reproduce the problems on our test sever as well when I addedfull password protection for the Solr test server. As I wrote above, thelogs does not seem to report problems with the Solr server, but thecrawled resources instead.

I have added two logs. One from the production server, and another fromthe test server. Log level is set to DEBUG for HttpClient. The prod jobjust stops and hangs, maybe due to a db lock. The test stops with themessage "Error: Repeated service interruptions - failure processingdocument: null" ("read timed out" in simple history).


The logs are available here:
http://folk.uio.no/erlendfg/manifoldcf/

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Timeout problems with web crawling

Reply via email to