Re: Timeout problems with web crawling

Karl Wright Tue, 23 Apr 2013 04:00:34 -0700

I take back the "no exceptions" comment.  We are getting one in the
testhost log:


 INFO 2013-04-22 17:39:39,387 (Worker thread '27') - WEB: FETCH
URL|http://www.ibsen.uio.no/REGINFO_peAGa.xhtml?bokstav=G|1366644879398+299979|-1|0|java.net.SocketTimeoutException|
Read timed out
 WARN 2013-04-22 17:39:39,387 (Worker thread '27') - Pre-ingest
service interruption reported for job 1360671306324 connection
'web_crawler': Timed out waiting for IO for
'http://www.ibsen.uio.no/REGINFO_peAGa.xhtml?bokstav=G': Read timed
out

It really does seem to be a socket timeout.  It looks like it was able
to establish a connection, but then waited 5 minutes for any data to
appear.  Can you fetch this URL without problem using the same headers
- esp. the User-Agent header?  It may be that your crawler is being
blocked by this site.

Karl




On Tue, Apr 23, 2013 at 6:50 AM, Karl Wright <[email protected]> wrote:

> The solr indexing seems to be working fine on the test host.  I haven't
> verified that is true on the production host.  The cause of the production
> host hanging, though, may be the really awful stuffer query plan.  It seems
> to hang but in fact just gets very very slow.
>
> Can you dump the postgresql schema that is in place on the production
> machine?  Specifically, I want to see the jobqueue table's indexes.
>
> I do not see any exceptions at all logged either place.  If there's a
> service interruption, usually a warning log entry is dumped.  Not seeing
> that though.
>
>
>
>
> On Tue, Apr 23, 2013 at 6:22 AM, Erlend Garåsen 
> <[email protected]>wrote:
>
>>
>> I'm still having problems with web crawling using trunk with updated Http
>> client. It seems that the problems occur when Solr is password protected
>> even though the error messages in my logs indicate a timeout problem. I'm
>> not 100 % sure, but it seems that the problem starts as soon as I'm
>> enabling password protection.
>>
>> We have struggled a lot with the web crawler in production mode recently,
>> but I thought that we managed to get around these problems when "expect 100
>> continue" was added to the header (now added in trunk). Then we discovered
>> a Resin bug which sent a wrong http status code back when this header was
>> enabled, but this has been solved by moving the authentication
>> configuration to Apache HTTP server instead (using .htaccess). So
>> everything *should* work, but it doesn't. Now I have managed to reproduce
>> the problems on our test sever as well when I added full password
>> protection for the Solr test server. As I wrote above, the logs does not
>> seem to report problems with the Solr server, but the crawled resources
>> instead.
>>
>> I have added two logs. One from the production server, and another from
>> the test server. Log level is set to DEBUG for HttpClient. The prod job
>> just stops and hangs, maybe due to a db lock. The test stops with the
>> message "Error: Repeated service interruptions - failure processing
>> document: null" ("read timed out" in simple history).
>>
>> The logs are available here:
>> http://folk.uio.no/erlendfg/**manifoldcf/<http://folk.uio.no/erlendfg/manifoldcf/>
>>
>> Erlend
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>> 31050
>>
>
>

Re: Timeout problems with web crawling

Reply via email to