Hi Arcadius, The socket timeout is set by default to five minutes. That should be sufficient even if you have a very slow internet connection. And even if once in a while there is a timeout, ManifoldCF will retry the connection for some period of time before giving up entirely.
Instead of trying to set even a longer timeout, I'd try to verify that you can connect to the web page from that machine through a browser. It may be that something else is wrong, such as a firewall etc. Karl On Sat, Jul 21, 2012 at 4:00 PM, Arcadius Ahouansou <[email protected]> wrote: > > Hello. > > I am running ManifoldCF 0.6 as a web crawler and indexing into Solr4. > > When I run against a local website running on my local machine, things work > well. > > However, when I am crawling a different site, a remote one, I get the > warning below and nothing get indexed. > > - Any idea about what my be causing this? > > - I thought that this may be because of my slow network connection: > Is there a way I could change the default timeout/readTimeout for HTTP > connection in manifoldCF? > > > Thanks. > > Arcadius. > > > ---- > WARN 2012-07-21 19:04:55,602 (Worker thread '20') - Socket timeout > exception reading socket stream: Read timed out > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(Unknown Source) > at java.net.SocketInputStream.read(Unknown Source) > at java.io.BufferedInputStream.read1(Unknown Source) > at java.io.BufferedInputStream.read(Unknown Source) > at org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown > Source) > at java.io.FilterInputStream.read(Unknown Source) > at org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown Source) > at > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.basicRead(ThrottledFetcher.java:2010) > at > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.read(ThrottledFetcher.java:1974) > at > org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:95) > at > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745) > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318) > WARN 2012-07-21 19:05:10,867 (Worker thread '24') - Socket timeout > exception reading socket stream: Read timed out > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(Unknown Source) > at java.net.SocketInputStream.read(Unknown Source) > at java.io.BufferedInputStream.read1(Unknown Source) > at java.io.BufferedInputStream.read(Unknown Source) > at org.apache.commons.httpclient.ChunkedInputStream.read(Unknown Source) > at java.io.FilterInputStream.read(Unknown Source) > at org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown Source) > at > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.basicRead(ThrottledFetcher.java:2010) > at > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.read(ThrottledFetcher.java:1974) > at > org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:95) > at > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745) > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318) > WARN 2012-07-21 19:09:55,612 (Worker thread '20') - Pre-ingest service > interruption reported for job 1342882564711 connection > 'MyRemoteWebConnector': Socket timeout: Read timed out > WARN 2012-07-21 19:10:10,876 (Worker thread '24') - Pre-ingest service > interruption reported for job 1342882564711 connection > 'MyRemoteWebConnector': Socket timeout: Read timed out > > ----
