Hi Arcadius,

The socket timeout is set by default to five minutes.  That should be
sufficient even if you have a very slow internet connection.  And even
if once in a while there is a timeout, ManifoldCF will retry the
connection for some period of time before giving up entirely.

Instead of trying to set even a longer timeout, I'd try to verify that
you can connect to the web page from that machine through a browser.
It may be that something else is wrong, such as a firewall etc.

Karl

On Sat, Jul 21, 2012 at 4:00 PM, Arcadius Ahouansou
<[email protected]> wrote:
>
> Hello.
>
> I am running ManifoldCF 0.6 as a web crawler and indexing into Solr4.
>
> When I run against a local website running on my local machine, things work
> well.
>
> However, when I am crawling a different site, a remote one, I get the
> warning below and nothing get indexed.
>
> - Any idea about what my be causing this?
>
> - I thought that this may be because of my slow network connection:
> Is there a way I could change the default timeout/readTimeout for HTTP
> connection in manifoldCF?
>
>
> Thanks.
>
> Arcadius.
>
>
> ----
>  WARN 2012-07-21 19:04:55,602 (Worker thread '20') - Socket timeout
> exception reading socket stream: Read timed out
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(Unknown Source)
> at java.net.SocketInputStream.read(Unknown Source)
> at java.io.BufferedInputStream.read1(Unknown Source)
> at java.io.BufferedInputStream.read(Unknown Source)
> at org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
> Source)
> at java.io.FilterInputStream.read(Unknown Source)
> at org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown Source)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.basicRead(ThrottledFetcher.java:2010)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.read(ThrottledFetcher.java:1974)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:95)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745)
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)
>  WARN 2012-07-21 19:05:10,867 (Worker thread '24') - Socket timeout
> exception reading socket stream: Read timed out
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(Unknown Source)
> at java.net.SocketInputStream.read(Unknown Source)
> at java.io.BufferedInputStream.read1(Unknown Source)
> at java.io.BufferedInputStream.read(Unknown Source)
> at org.apache.commons.httpclient.ChunkedInputStream.read(Unknown Source)
> at java.io.FilterInputStream.read(Unknown Source)
> at org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown Source)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.basicRead(ThrottledFetcher.java:2010)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.read(ThrottledFetcher.java:1974)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:95)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745)
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)
>  WARN 2012-07-21 19:09:55,612 (Worker thread '20') - Pre-ingest service
> interruption reported for job 1342882564711 connection
> 'MyRemoteWebConnector': Socket timeout: Read timed out
>  WARN 2012-07-21 19:10:10,876 (Worker thread '24') - Pre-ingest service
> interruption reported for job 1342882564711 connection
> 'MyRemoteWebConnector': Socket timeout: Read timed out
>
> ----

Reply via email to