I take back the "no exceptions" comment. We are getting one in the testhost log:
INFO 2013-04-22 17:39:39,387 (Worker thread '27') - WEB: FETCH URL|http://www.ibsen.uio.no/REGINFO_peAGa.xhtml?bokstav=G|1366644879398+299979|-1|0|java.net.SocketTimeoutException| Read timed out WARN 2013-04-22 17:39:39,387 (Worker thread '27') - Pre-ingest service interruption reported for job 1360671306324 connection 'web_crawler': Timed out waiting for IO for 'http://www.ibsen.uio.no/REGINFO_peAGa.xhtml?bokstav=G': Read timed out It really does seem to be a socket timeout. It looks like it was able to establish a connection, but then waited 5 minutes for any data to appear. Can you fetch this URL without problem using the same headers - esp. the User-Agent header? It may be that your crawler is being blocked by this site. Karl On Tue, Apr 23, 2013 at 6:50 AM, Karl Wright <[email protected]> wrote: > The solr indexing seems to be working fine on the test host. I haven't > verified that is true on the production host. The cause of the production > host hanging, though, may be the really awful stuffer query plan. It seems > to hang but in fact just gets very very slow. > > Can you dump the postgresql schema that is in place on the production > machine? Specifically, I want to see the jobqueue table's indexes. > > I do not see any exceptions at all logged either place. If there's a > service interruption, usually a warning log entry is dumped. Not seeing > that though. > > > > > On Tue, Apr 23, 2013 at 6:22 AM, Erlend Garåsen > <[email protected]>wrote: > >> >> I'm still having problems with web crawling using trunk with updated Http >> client. It seems that the problems occur when Solr is password protected >> even though the error messages in my logs indicate a timeout problem. I'm >> not 100 % sure, but it seems that the problem starts as soon as I'm >> enabling password protection. >> >> We have struggled a lot with the web crawler in production mode recently, >> but I thought that we managed to get around these problems when "expect 100 >> continue" was added to the header (now added in trunk). Then we discovered >> a Resin bug which sent a wrong http status code back when this header was >> enabled, but this has been solved by moving the authentication >> configuration to Apache HTTP server instead (using .htaccess). So >> everything *should* work, but it doesn't. Now I have managed to reproduce >> the problems on our test sever as well when I added full password >> protection for the Solr test server. As I wrote above, the logs does not >> seem to report problems with the Solr server, but the crawled resources >> instead. >> >> I have added two logs. One from the production server, and another from >> the test server. Log level is set to DEBUG for HttpClient. The prod job >> just stops and hangs, maybe due to a db lock. The test stops with the >> message "Error: Repeated service interruptions - failure processing >> document: null" ("read timed out" in simple history). >> >> The logs are available here: >> http://folk.uio.no/erlendfg/**manifoldcf/<http://folk.uio.no/erlendfg/manifoldcf/> >> >> Erlend >> >> -- >> Erlend Garåsen >> Center for Information Technology Services >> University of Oslo >> P.O. Box 1086 Blindern, N-0317 OSLO, Norway >> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: >> 31050 >> > >
