Hello. I am running ManifoldCF 0.6 as a web crawler and indexing into Solr4.
When I run against a local website running on my local machine, things work well. However, when I am crawling a different site, a remote one, I get the warning below and nothing get indexed. - Any idea about what my be causing this? - I thought that this may be because of my slow network connection: Is there a way I could change the default timeout/readTimeout for HTTP connection in manifoldCF? Thanks. Arcadius. ---- WARN 2012-07-21 19:04:55,602 (Worker thread '20') - Socket timeout exception reading socket stream: Read timed out java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at java.io.BufferedInputStream.read1(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) at org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown Source) at java.io.FilterInputStream.read(Unknown Source) at org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown Source) at org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.basicRead(ThrottledFetcher.java:2010) at org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.read(ThrottledFetcher.java:1974) at org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:95) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318) WARN 2012-07-21 19:05:10,867 (Worker thread '24') - Socket timeout exception reading socket stream: Read timed out java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at java.io.BufferedInputStream.read1(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) at org.apache.commons.httpclient.ChunkedInputStream.read(Unknown Source) at java.io.FilterInputStream.read(Unknown Source) at org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown Source) at org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.basicRead(ThrottledFetcher.java:2010) at org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.read(ThrottledFetcher.java:1974) at org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:95) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318) WARN 2012-07-21 19:09:55,612 (Worker thread '20') - Pre-ingest service interruption reported for job 1342882564711 connection 'MyRemoteWebConnector': Socket timeout: Read timed out WARN 2012-07-21 19:10:10,876 (Worker thread '24') - Pre-ingest service interruption reported for job 1342882564711 connection 'MyRemoteWebConnector': Socket timeout: Read timed out ----
