Karl, I don't know where you live but if you come to Belgium, stop in Brussels for a good Belgian beer ;-) In other words, setting the socket timeout to 2000 instead of 900 has solved the problem. It has indexed about 160,000 documents in 2 hours. On the other hand, the Manifold/Solr machine (all run in the same Windows VM) has been allocated 8 3.6GHZ CPU and 32GB memory, and is used only for the indexing test, no search on SOLR. So the fact that a timeout of 900 seconds was not enough looks strange: is it possible that some of these 160,000 docments take more than 15 minutes to be handled by SOLR? Ronny&Frédéric
On Thu, Nov 7, 2013 at 4:30 PM, Karl Wright <[email protected]> wrote: > Hi Ronny, > > The failure is being caused because the time spent transferring data to > Solr is exceeding the socket timeout you have set for the Solr connection, > for some documents. > > This is probably due to excessive load on the Solr instance. My > suggestion is to increase the socket timeout on your solr connection to at > least 30 minutes or more to see if this resolves. > > Thanks, > Karl > > > > On Thu, Nov 7, 2013 at 9:30 AM, Ronny Heylen <[email protected]>wrote: > >> Hi, >> We have reset thottling to 10 for AD and SOLR (2 for the windows >> repository). >> Job indexing all pptx to null ouput has run successfully (162733 >> documents) >> Job indexing all pptx to solr still fails, manifoldcf.log contains: >> WARN 2013-11-07 14:34:06,502 (Worker thread '29') - JCIFS: Possibly >> transient exception detected on attempt 1 while getting share security: All >> pipe instances are busy. >> jcifs.smb.SmbException: All pipe instances are busy. >> at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:563) >> at jcifs.smb.SmbTransport.send(SmbTransport.java:663) >> at jcifs.smb.SmbSession.send(SmbSession.java:238) >> at jcifs.smb.SmbTree.send(SmbTree.java:119) >> at jcifs.smb.SmbFile.send(SmbFile.java:775) >> at jcifs.smb.SmbFile.open0(SmbFile.java:989) >> at jcifs.smb.SmbFile.open(SmbFile.java:1006) >> at jcifs.smb.SmbFileOutputStream.<init>(SmbFileOutputStream.java:142) >> at >> jcifs.smb.TransactNamedPipeOutputStream.<init>(TransactNamedPipeOutputStream.java:32) >> at >> jcifs.smb.SmbNamedPipe.getNamedPipeOutputStream(SmbNamedPipe.java:187) >> at >> jcifs.dcerpc.DcerpcPipeHandle.doSendFragment(DcerpcPipeHandle.java:68) >> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:190) >> at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:126) >> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:140) >> at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2943) >> at >> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2393) >> at >> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.describeDocumentSecurity(SharedDriveConnector.java:1045) >> at >> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getDocumentVersions(SharedDriveConnector.java:554) >> at >> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:322) >> WARN 2013-11-07 14:55:45,257 (Worker thread '30') - IO exception during >> indexing: Read timed out >> java.net.SocketTimeoutException: Read timed out >> at java.net.SocketInputStream.socketRead0(Native Method) >> at java.net.SocketInputStream.read(SocketInputStream.java:152) >> at java.net.SocketInputStream.read(SocketInputStream.java:122) >> at >> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166) >> at >> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90) >> at >> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281) >> at >> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92) >> at >> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62) >> at >> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254) >> at >> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289) >> at >> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252) >> at >> org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191) >> at >> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300) >> at >> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127) >> at >> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715) >> at >> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520) >> at >> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) >> at >> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) >> at >> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784) >> at >> org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:291) >> at >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) >> at >> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) >> at >> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:919) >> WARN 2013-11-07 14:55:45,273 (Worker thread '30') - Service interruption >> reported for job 1383765534700 connection 'Filesharesrv1': IO exception >> during indexing: Read timed out >> ERROR 2013-11-07 14:55:45,304 (Worker thread '30') - Exception tossed: >> Repeated service interruptions - failure processing document: Read timed out >> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated >> service interruptions - failure processing document: Read timed out >> at >> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:586) >> Caused by: java.net.SocketTimeoutException: Read timed out >> at java.net.SocketInputStream.socketRead0(Native Method) >> at java.net.SocketInputStream.read(SocketInputStream.java:152) >> at java.net.SocketInputStream.read(SocketInputStream.java:122) >> at >> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166) >> at >> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90) >> at >> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281) >> at >> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92) >> at >> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62) >> at >> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254) >> at >> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289) >> at >> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252) >> at >> org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191) >> at >> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300) >> at >> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127) >> at >> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715) >> at >> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520) >> at >> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) >> at >> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) >> at >> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784) >> at >> org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:291) >> at >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) >> at >> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) >> at >> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:919) >> WARN 2013-11-07 15:06:04,235 (Worker thread '9') - IO exception during >> indexing: Read timed out >> java.net.SocketTimeoutException: Read timed out >> at java.net.SocketInputStream.socketRead0(Native Method) >> at java.net.SocketInputStream.read(SocketInputStream.java:152) >> at java.net.SocketInputStream.read(SocketInputStream.java:122) >> at >> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166) >> at >> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90) >> at >> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281) >> at >> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92) >> at >> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62) >> at >> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254) >> at >> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289) >> at >> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252) >> at >> org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191) >> at >> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300) >> at >> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127) >> at >> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715) >> at >> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520) >> at >> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) >> at >> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) >> at >> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784) >> at >> org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:291) >> at >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) >> at >> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) >> at >> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:919) >> WARN 2013-11-07 15:06:04,235 (Worker thread '9') - Service interruption >> reported for job 1383765534700 connection 'Filesharesrv1': IO exception >> during indexing: Read timed out >> >> >> >> On Wed, Nov 6, 2013 at 9:28 PM, Karl Wright <[email protected]> wrote: >> >>> Hi Ronny, >>> >>> One minor thing: you should need to set throttling to 2 ONLY for the >>> Windows repository connection, not for AD or Solr. >>> >>> >>> As for how to debug this issue, first off you should be looking in the >>> manifoldcf.log file (or the equivalent). You should see WARN messages from >>> the shared file connector under most conditions when there's a service >>> interruption. You would probably see "Read timed out" warnings if you >>> looked there, since that is what aborted the job run, along with a stack >>> trace. However, that's not going to add much information to the analysis >>> at this point. >>> >>> What might be valuable is to determine whether the problem is happening >>> on the Windows side or on the Solr side. At this point I can't tell. You >>> could, however, create a null output connection, and create a similar job >>> the sends its output there, and see if it completes. Can you do this and >>> get back to me? >>> >>> Thanks, >>> Karl >>> >>> >>> >>> >>> >>> On Wed, Nov 6, 2013 at 3:17 PM, Ronny Heylen >>> <[email protected]>wrote: >>> >>>> Hi, >>>> We use Manifoldcf 1.3 and Solr 4.4 to index a shared network drive with >>>> several hundred thousands documents. >>>> Doing only one manifoldcf job to index all the drive was always giving >>>> some kind of error, therefore to better understand where the problem can >>>> be, we made one job to index all *.doc*, another one for *.xls*, another >>>> one for *.pdf ... >>>> Using the help from the list (thanks!) we set the size limit to 100MB >>>> and all jobs succeeds (great) except the one for *.pptx >>>> The message is >>>> Error: Repeated service interruptions - failure processing document: >>>> Read timed out >>>> We don't find any error in the log we have searched: solr.log, ... >>>> Based on some indications found on Internet, we have set the Throttling >>>> max connections setting to 2 (instead of 10) in 3 places: >>>> output connection to SOLR >>>> authority connection to the Active Directory >>>> repository connection to the windows file share >>>> But the problem stays the same. >>>> We have tried on another machine with SOLR 4.5 and Manifoldcf 1.4, same >>>> problem. >>>> We can let run the job for all *.PDF, or all *.DOC*, or all *.XLS* >>>> without problem, but the same message comes always for *.PPTX. >>>> The last time the job stops with the message, it displays (not the same >>>> numbers for each run as the windows drive is changing) 56311 documents, >>>> with 17466 busy and 38847 processed. >>>> As we don't find anything in the log (but probably we don't look at the >>>> correct place), we don't know what to do. >>>> Thanks for your help, >>>> Ronny and Frédéric >>>> >>> >>> >> >
