Re: SharePoint: Error closing connection to file
I just did a check-in which should fix the NPE. Hi Karl, Not fully tested but I think this commit fixed the issue. I run a few crawls without problem. Thank for it. I also post this on solr user ML : http://search-lucene.com/m/IeWzIc11mS About the weird icon that pops up. Attached images too. Ahmet
Re: SharePoint: Error closing connection to file
Hi Karl, Somehow those scanned pdf files do not throw exception. I tired sending them using curl : curl http://localhost:8983/solr/update/extract?literal.id=doc1commit=true; -F myfile=@ticaret_sicil_gazetesi.pdf No exception in solr logs. File is indexed. But when i do this, java coffee icon appears in Dock. I don't know what this is. I will further investigate on tika/solr side. Thanks for your support on this. Anyways, I still sometimes get : Got an unknown remote exception accessing site - axis fault = Server.userException, detail = java.net.UnknownHostException: null I see following entries in manifoldcf.log WARN 2012-08-14 17:39:41,099 (Thread-10418) - Cookie rejected: $Version=0; http%3A%2F%2Fiknowtest%2FDiscovery=WorkspaceSiteName=SUtOb3c=WorkspaceSiteUrl=aHR0cDovL2lrbm93dGVzdA==WorkspaceSiteTime=MjAxMi0wOC0xNFQxNDozOTo0MQ==; $Path=/_vti_bin/Discovery.asmx. Illegal path attribute /_vti_bin/Discovery.asmx. Path of origin: /Pages/denemeIkGeneralPage0712-6740.aspx FATAL 2012-08-14 17:55:55,096 (Startup thread) - Error tossed: null java.lang.NullPointerException at org.apache.manifoldcf.crawler.interfaces.QueueTracker$PriorityKey.hashCode(QueueTracker.java:726) at java.util.HashMap.get(HashMap.java:300) at org.apache.manifoldcf.crawler.interfaces.QueueTracker.calculatePriority(QueueTracker.java:518) at org.apache.manifoldcf.crawler.system.SeedingActivity.writeSeedDocuments(SeedingActivity.java:225) at org.apache.manifoldcf.crawler.system.SeedingActivity.doneSeeding(SeedingActivity.java:165) at org.apache.manifoldcf.crawler.system.StartupThread.run(StartupThread.java:181) --- On Tue, 8/14/12, Karl Wright daddy...@gmail.com wrote: From: Karl Wright daddy...@gmail.com Subject: Re: SharePoint: Error closing connection to file To: dev@manifoldcf.apache.org Date: Tuesday, August 14, 2012, 9:32 AM I've committed a fix to how the WorkerThread handles service interruptions. This should eliminate the unexpected value exception. Could you confirm that it does? After that, I believe you will have to look at your Tika setup on Solr to figure out how to avoid having PDFs blow up the pipeline. You should confirm first that Tika is indeed throwing an exception when a PDF is sent to it, of course, and that Solr is closing the http connection under those conditions. Thanks, Karl On Tue, Aug 14, 2012 at 1:28 AM, Karl Wright daddy...@gmail.com wrote: There are two different issues here. The first one is that you are having a connection close on you; not sure the reason why, but could potentially be caused by a Tika exception in Solr. The second is that the refactored WorkerThread code I checked in Sunday might have a bug in handling exceptions of this kind. I'll have a look at these and get back to you shortly. Karl On Mon, Aug 13, 2012 at 10:28 PM, Ahmet Arslan iori...@yahoo.com wrote: If I modify my Path Rules to index only *.doc and *.docx files, I can re-index over and over without restarting anything. Everything works fine. It seems that there is a problem with non text extractable files. /Documents/*.doc file include /Documents/*.docx file include --- On Tue, 8/14/12, Ahmet Arslan iori...@yahoo.com wrote: From: Ahmet Arslan iori...@yahoo.com Subject: Re: SharePoint: Error closing connection to file To: dev@manifoldcf.apache.org Date: Tuesday, August 14, 2012, 5:20 AM Also after this, when i hit View Repository Connection Status i get : Got an unknown remote exception accessing site - axis fault = Server.userException, detail = java.net.UnknownHostException: null I restart mcf, I get Connection status: Connection working at View Repository Connection Status page. --- On Tue, 8/14/12, Ahmet Arslan iori...@yahoo.com wrote: From: Ahmet Arslan iori...@yahoo.com Subject: SharePoint: Error closing connection to file To: dev@manifoldcf.apache.org Date: Tuesday, August 14, 2012, 5:18 AM Hello, Using solr output connector and SP2010 Repository connector, I am indexing a document library named Documents. This library has some scanned pdf documents. Very First crawl indexes all 91 docs. When I hit Re-ingest all associated documents and start second crawl, I get : Error: Unexpected jobqueue status - record id 1344907007021, expecting active status, saw 3 Here is the stack trace: When i look at http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf, it is an image (scanned) pdf. WARN 2012-08-14 05:13:22,068 (Worker thread '39') - SharePoint: Error closing connection to file 'http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf': Connection reset java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:113
Re: SharePoint: Error closing connection to file
I just did a check-in which should fix the NPE. The other exception is a warning; the crawler should retry the document when that happens, so I would not get excited unless the job aborts. Karl On Tue, Aug 14, 2012 at 5:08 PM, Ahmet Arslan iori...@yahoo.com wrote: Hi Karl, Somehow those scanned pdf files do not throw exception. I tired sending them using curl : curl http://localhost:8983/solr/update/extract?literal.id=doc1commit=true; -F myfile=@ticaret_sicil_gazetesi.pdf No exception in solr logs. File is indexed. But when i do this, java coffee icon appears in Dock. I don't know what this is. I will further investigate on tika/solr side. Thanks for your support on this. Anyways, I still sometimes get : Got an unknown remote exception accessing site - axis fault = Server.userException, detail = java.net.UnknownHostException: null I see following entries in manifoldcf.log WARN 2012-08-14 17:39:41,099 (Thread-10418) - Cookie rejected: $Version=0; http%3A%2F%2Fiknowtest%2FDiscovery=WorkspaceSiteName=SUtOb3c=WorkspaceSiteUrl=aHR0cDovL2lrbm93dGVzdA==WorkspaceSiteTime=MjAxMi0wOC0xNFQxNDozOTo0MQ==; $Path=/_vti_bin/Discovery.asmx. Illegal path attribute /_vti_bin/Discovery.asmx. Path of origin: /Pages/denemeIkGeneralPage0712-6740.aspx FATAL 2012-08-14 17:55:55,096 (Startup thread) - Error tossed: null java.lang.NullPointerException at org.apache.manifoldcf.crawler.interfaces.QueueTracker$PriorityKey.hashCode(QueueTracker.java:726) at java.util.HashMap.get(HashMap.java:300) at org.apache.manifoldcf.crawler.interfaces.QueueTracker.calculatePriority(QueueTracker.java:518) at org.apache.manifoldcf.crawler.system.SeedingActivity.writeSeedDocuments(SeedingActivity.java:225) at org.apache.manifoldcf.crawler.system.SeedingActivity.doneSeeding(SeedingActivity.java:165) at org.apache.manifoldcf.crawler.system.StartupThread.run(StartupThread.java:181) --- On Tue, 8/14/12, Karl Wright daddy...@gmail.com wrote: From: Karl Wright daddy...@gmail.com Subject: Re: SharePoint: Error closing connection to file To: dev@manifoldcf.apache.org Date: Tuesday, August 14, 2012, 9:32 AM I've committed a fix to how the WorkerThread handles service interruptions. This should eliminate the unexpected value exception. Could you confirm that it does? After that, I believe you will have to look at your Tika setup on Solr to figure out how to avoid having PDFs blow up the pipeline. You should confirm first that Tika is indeed throwing an exception when a PDF is sent to it, of course, and that Solr is closing the http connection under those conditions. Thanks, Karl On Tue, Aug 14, 2012 at 1:28 AM, Karl Wright daddy...@gmail.com wrote: There are two different issues here. The first one is that you are having a connection close on you; not sure the reason why, but could potentially be caused by a Tika exception in Solr. The second is that the refactored WorkerThread code I checked in Sunday might have a bug in handling exceptions of this kind. I'll have a look at these and get back to you shortly. Karl On Mon, Aug 13, 2012 at 10:28 PM, Ahmet Arslan iori...@yahoo.com wrote: If I modify my Path Rules to index only *.doc and *.docx files, I can re-index over and over without restarting anything. Everything works fine. It seems that there is a problem with non text extractable files. /Documents/*.doc fileinclude /Documents/*.docx fileinclude --- On Tue, 8/14/12, Ahmet Arslan iori...@yahoo.com wrote: From: Ahmet Arslan iori...@yahoo.com Subject: Re: SharePoint: Error closing connection to file To: dev@manifoldcf.apache.org Date: Tuesday, August 14, 2012, 5:20 AM Also after this, when i hit View Repository Connection Status i get : Got an unknown remote exception accessing site - axis fault = Server.userException, detail = java.net.UnknownHostException: null I restart mcf, I get Connection status: Connection working at View Repository Connection Status page. --- On Tue, 8/14/12, Ahmet Arslan iori...@yahoo.com wrote: From: Ahmet Arslan iori...@yahoo.com Subject: SharePoint: Error closing connection to file To: dev@manifoldcf.apache.org Date: Tuesday, August 14, 2012, 5:18 AM Hello, Using solr output connector and SP2010 Repository connector, I am indexing a document library named Documents. This library has some scanned pdf documents. Very First crawl indexes all 91 docs. When I hit Re-ingest all associated documents and start second crawl, I get : Error: Unexpected jobqueue status - record id 1344907007021, expecting active status, saw 3 Here is the stack trace: When i look at http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf, it is an image (scanned) pdf. WARN 2012-08-14 05:13:22,068 (Worker thread '39
SharePoint: Error closing connection to file
Hello, Using solr output connector and SP2010 Repository connector, I am indexing a document library named Documents. This library has some scanned pdf documents. Very First crawl indexes all 91 docs. When I hit Re-ingest all associated documents and start second crawl, I get : Error: Unexpected jobqueue status - record id 1344907007021, expecting active status, saw 3 Here is the stack trace: When i look at http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf, it is an image (scanned) pdf. WARN 2012-08-14 05:13:22,068 (Worker thread '39') - SharePoint: Error closing connection to file 'http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf': Connection reset java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:113) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown Source) at org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown Source) at org.apache.commons.httpclient.ChunkedInputStream.exhaustInputStream(Unknown Source) at org.apache.commons.httpclient.ContentLengthInputStream.close(Unknown Source) at java.io.FilterInputStream.close(FilterInputStream.java:155) at org.apache.commons.httpclient.AutoCloseInputStream.notifyWatcher(Unknown Source) at org.apache.commons.httpclient.AutoCloseInputStream.close(Unknown Source) at org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1457) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:549) DEBUG 2012-08-14 05:13:22,072 (Worker thread '42') - SharePoint: Path attribute name is null WARN 2012-08-14 05:13:22,081 (Worker thread '39') - SharePoint: IOException thrown: Connection reset java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:168) at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown Source) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown Source) at java.io.FilterInputStream.read(FilterInputStream.java:90) at org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown Source) at org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1447) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:549) WARN 2012-08-14 05:13:22,186 (Worker thread '39') - Service interruption reported for job 1344906886879 connection 'SP2010': SharePoint is down attempting to read 'http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf', retrying: Connection reset ERROR 2012-08-14 05:13:22,230 (Worker thread '39') - Exception tossed: Unexpected jobqueue status - record id 1344907007021, expecting active status, saw 3 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected jobqueue status - record id 1344907007021, expecting active status, saw 3 at org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:711) at org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:2435) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:745)
Re: SharePoint: Error closing connection to file
Also after this, when i hit View Repository Connection Status i get : Got an unknown remote exception accessing site - axis fault = Server.userException, detail = java.net.UnknownHostException: null I restart mcf, I get Connection status: Connection working at View Repository Connection Status page. --- On Tue, 8/14/12, Ahmet Arslan iori...@yahoo.com wrote: From: Ahmet Arslan iori...@yahoo.com Subject: SharePoint: Error closing connection to file To: dev@manifoldcf.apache.org Date: Tuesday, August 14, 2012, 5:18 AM Hello, Using solr output connector and SP2010 Repository connector, I am indexing a document library named Documents. This library has some scanned pdf documents. Very First crawl indexes all 91 docs. When I hit Re-ingest all associated documents and start second crawl, I get : Error: Unexpected jobqueue status - record id 1344907007021, expecting active status, saw 3 Here is the stack trace: When i look at http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf, it is an image (scanned) pdf. WARN 2012-08-14 05:13:22,068 (Worker thread '39') - SharePoint: Error closing connection to file 'http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf': Connection reset java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:113) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown Source) at org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown Source) at org.apache.commons.httpclient.ChunkedInputStream.exhaustInputStream(Unknown Source) at org.apache.commons.httpclient.ContentLengthInputStream.close(Unknown Source) at java.io.FilterInputStream.close(FilterInputStream.java:155) at org.apache.commons.httpclient.AutoCloseInputStream.notifyWatcher(Unknown Source) at org.apache.commons.httpclient.AutoCloseInputStream.close(Unknown Source) at org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1457) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:549) DEBUG 2012-08-14 05:13:22,072 (Worker thread '42') - SharePoint: Path attribute name is null WARN 2012-08-14 05:13:22,081 (Worker thread '39') - SharePoint: IOException thrown: Connection reset java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:168) at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown Source) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown Source) at java.io.FilterInputStream.read(FilterInputStream.java:90) at org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown Source) at org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1447) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:549) WARN 2012-08-14 05:13:22,186 (Worker thread '39') - Service interruption reported for job 1344906886879 connection 'SP2010': SharePoint is down attempting to read 'http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf', retrying: Connection reset ERROR 2012-08-14 05:13:22,230 (Worker thread '39') - Exception tossed: Unexpected jobqueue status - record id 1344907007021, expecting active status, saw 3 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected jobqueue status - record id 1344907007021, expecting active status, saw 3 at org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:711) at org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:2435) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:745)
Re: SharePoint: Error closing connection to file
There are two different issues here. The first one is that you are having a connection close on you; not sure the reason why, but could potentially be caused by a Tika exception in Solr. The second is that the refactored WorkerThread code I checked in Sunday might have a bug in handling exceptions of this kind. I'll have a look at these and get back to you shortly. Karl On Mon, Aug 13, 2012 at 10:28 PM, Ahmet Arslan iori...@yahoo.com wrote: If I modify my Path Rules to index only *.doc and *.docx files, I can re-index over and over without restarting anything. Everything works fine. It seems that there is a problem with non text extractable files. /Documents/*.docfileinclude /Documents/*.docx fileinclude --- On Tue, 8/14/12, Ahmet Arslan iori...@yahoo.com wrote: From: Ahmet Arslan iori...@yahoo.com Subject: Re: SharePoint: Error closing connection to file To: dev@manifoldcf.apache.org Date: Tuesday, August 14, 2012, 5:20 AM Also after this, when i hit View Repository Connection Status i get : Got an unknown remote exception accessing site - axis fault = Server.userException, detail = java.net.UnknownHostException: null I restart mcf, I get Connection status: Connection working at View Repository Connection Status page. --- On Tue, 8/14/12, Ahmet Arslan iori...@yahoo.com wrote: From: Ahmet Arslan iori...@yahoo.com Subject: SharePoint: Error closing connection to file To: dev@manifoldcf.apache.org Date: Tuesday, August 14, 2012, 5:18 AM Hello, Using solr output connector and SP2010 Repository connector, I am indexing a document library named Documents. This library has some scanned pdf documents. Very First crawl indexes all 91 docs. When I hit Re-ingest all associated documents and start second crawl, I get : Error: Unexpected jobqueue status - record id 1344907007021, expecting active status, saw 3 Here is the stack trace: When i look at http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf, it is an image (scanned) pdf. WARN 2012-08-14 05:13:22,068 (Worker thread '39') - SharePoint: Error closing connection to file 'http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf': Connection reset java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:113) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown Source) at org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown Source) at org.apache.commons.httpclient.ChunkedInputStream.exhaustInputStream(Unknown Source) at org.apache.commons.httpclient.ContentLengthInputStream.close(Unknown Source) at java.io.FilterInputStream.close(FilterInputStream.java:155) at org.apache.commons.httpclient.AutoCloseInputStream.notifyWatcher(Unknown Source) at org.apache.commons.httpclient.AutoCloseInputStream.close(Unknown Source) at org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1457) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:549) DEBUG 2012-08-14 05:13:22,072 (Worker thread '42') - SharePoint: Path attribute name is null WARN 2012-08-14 05:13:22,081 (Worker thread '39') - SharePoint: IOException thrown: Connection reset java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:168) at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown Source) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown Source) at java.io.FilterInputStream.read(FilterInputStream.java:90) at org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown Source) at org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1447) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:549) WARN 2012-08-14 05:13:22,186 (Worker thread '39') - Service interruption reported for job 1344906886879 connection 'SP2010': SharePoint is down attempting to read