Hi, The TikaException issue means that Tika is having trouble processing your document. You can set up Solr to disable Tika exceptions. You can probably find the documentation here:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika Karl On Tue, Mar 24, 2015 at 11:50 AM, Ian Zapczynski < [email protected]> wrote: > I mostly get these repeated many, many times over: > > ERROR - 2015-03-24 14:48:40.321; org.apache.solr.common.SolrException; > null:org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.pdf.PDFParser@46a9acab > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) > <snip> > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.pdf.PDFParser@46a9acab > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > <snip> > Caused by: java.lang.StringIndexOutOfBoundsException: String index out of > range: 0 > at java.lang.String.charAt(String.java:646) > > Then of course I get some of these, which are expected when we have > encrypted or password-protected files: > > ERROR - 2015-03-24 14:48:40.962; org.apache.solr.common.SolrException; > org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@38902a7 > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) > <snip> > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@38902a7 > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > <snip> > Caused by: org.apache.poi.EncryptedDocumentException: Cannot process > encrypted word file > > > > > >>> Karl Wright <[email protected]> 3/24/2015 11:22 AM >>> > " failure processing document Server athttp://localhost:8983/solr > returned non ok status:500, message:Server Error"" > > That's an error occurring on Solr. What do the Solr logs say? > > Karl > > > On Tue, Mar 24, 2015 at 11:11 AM, Ian Zapczynski < > [email protected]> wrote: > >> Unfortunately I'm still getting stuck indexing. It at least appears to >> me that I have such a large number of password protected docs and scanned >> PDFs without OCR enabled that the job is dying on me before it even finds >> all the "good" docs. It will die with the error "Error: Repeated service >> interruptions - failure processing document Server at >> http://localhost:8983/solr returned non ok status:500, message:Server >> Error". My job tells me there are 168,595 documents, with 73,238 currently >> active, and 106,291 processed. At this point, if I keep restarting the job, >> it slowly adds a small number of new docs, but then dies again with the >> same error. The thing is, it has not indexed a large number of documents >> that should be indexable. >> To help clarify, it may be helpful to note that I am indexing a folder >> that contains thousands of folders within that which are named after >> companies we have associated with, with several files and folders within >> each. If I test searches in SOLR by reviewing a document and then >> performing a search based on that text, I get the expected search results >> consistently when I am searching files that are within folders with company >> names beginning with A, B, C, D, etc. However, I do not get results if I >> search for files within folders with company names beginning with >> R,S,T,U,V, etc. >> We are looking into whether we should batch convert the scanned PDFs to >> support OCR and thereby cut down on the number of problem docs, but for >> now, I'd like to just get all of the indexable documents into SOLR. >> Going back to my original question, should I consider breaking this >> single job into multiple jobs based on the letter of the alphabet? If so, I >> haven't been able to figure out a working regex to tell it to just pick up >> all files and folders within a folder which name begins with R-Z, for >> example. And if not that workaround, where do you suggest I go to resolve >> this? I'm not entirely sure what is causing all of the files and folders >> not to be traversed before my job dies (is this a ManifoldCF thing or a >> SOLR thing?) >> Thanks again for your help. >> >> >> >>> Ian Zapczynski 3/20/2015 2:25 PM >>> >> Thanks for the help, Karl. Yup, I was using the simple-to-set-up single >> process configuration, and silly me.... after I restarted from scratch at >> one point, I completely failed to update the combined-options-env.win >> config file that you referred to, so MCF was still set to use only 256 Mb >> despite my thinking otherwise. I've bumped it up to 4 Gb, and the job >> recovered and is finally again moving along. >> -Ian >> >> >>> Karl Wright <[email protected]> 3/20/2015 10:55 AM >>> >> Hi Ian, >> >> HSQLDB is an interesting database in that it is *not* memory constrained. >> It attempts to keep everything in memory. >> >> I'd strongly suggest either giving the MCF agents process a lot more >> memory, say 2G, if you want to keep using hsqldb. A better choice would be >> postgresql or mysql. There's a configuration file where you can put java >> switches for all of the processes; start by doing that. >> >> Thanks, >> Karl >> >> >> >> >> On Fri, Mar 20, 2015 at 9:29 AM, Ian Zapczynski < >> [email protected]> wrote: >> >>> Hi Karl, >>> I have SOLR and ManifoldCF running with Tomcat on a Windows 2012 R2 >>> server. Linux would have been my preference, but various logistics >>> prevented me from using that. I have set the maximum document length to be >>> 3072000. I chose a larger size than what might be normal because when I >>> first did a test, I could see that a lot of docs were getting rejected >>> based on size, and it seems folks around here don't reduce/shrink the size >>> of their PDFs. >>> The errors from the log are below. I was more busy paying attention to >>> the errors spit out to the console, which didn't so obviously point to the >>> backend database being the culprit. I'm guessing that I'm pushing the >>> database too hard and should really be using PostgreSQL, right? I don't >>> know why, but I didn't see or reference the deployment documentation that >>> covered using various other databases until now. I was working off of the >>> ManifoldCF End User Documentation as well as a (mostly) helpful blog post I >>> found elsewhere. >>> Much thanks, >>> -Ian >>> WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Solr exception >>> during indexing >>> file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file1.pdf >>> (500): Server at http://localhost:8983/solr returned non ok status:500, >>> message:Server Error >>> org.apache.solr.common.SolrException: Server at >>> http://localhost:8983/solr returned non ok status:500, message:Server >>> Error >>> at >>> org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:303) >>> at >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) >>> at >>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) >>> at >>> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:894) >>> WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Service interruption >>> reported for job 1426796577848 connection 'MACLSTR file server': Solr >>> exception during indexing >>> file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file3.pdf >>> (500): Server at http://localhost:8983/solr returned non ok status:500, >>> message:Server Error >>> ERROR 2015-03-19 18:31:45,730 (Job delete thread) - Job delete thread >>> aborting and restarting due to database connection reset: Database >>> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC >>> overhead limit exceeded >>> ERROR 2015-03-19 18:31:45,309 (Finisher thread) - Finisher thread >>> aborting and restarting due to database connection reset: Database >>> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC >>> overhead limit exceeded >>> ERROR 2015-03-19 18:31:43,043 (Set priority thread) - Set priority >>> thread aborting and restarting due to database connection reset: Database >>> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC >>> overhead limit exceeded >>> ERROR 2015-03-19 18:32:02,292 (Job notification thread) - Job >>> notification thread aborting and restarting due to database connection >>> reset: Database exception: SQLException doing query (S1000): >>> java.lang.OutOfMemoryError: GC overhead limit exceeded >>> FATAL 2015-03-19 18:32:05,870 (Thread-3838608) - >>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem >>> 531146 >>> WARN 2015-03-19 18:32:09,167 (Assessment thread) - Found a long-running >>> query (64919 ms): [SELECT id,status,connectionname FROM jobs WHERE >>> assessmentstate=? FOR UPDATE] >>> WARN 2015-03-19 18:32:09,167 (Assessment thread) - Parameter 0: 'N' >>> ERROR 2015-03-19 18:32:09,167 (Job reset thread) - Job reset thread >>> aborting and restarting due to database connection reset: Database >>> exception: SQLException doing query (S1000): java.lang.RuntimeException: >>> Logging failed when attempting to log: >>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem >>> 531146 java.lang.RuntimeException: Logging failed when attempting to log: >>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem >>> 531146 >>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database >>> exception: SQLException doing query (S1000): java.lang.RuntimeException: >>> Logging failed when attempting to log: >>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem >>> 531146 java.lang.RuntimeException: Logging failed when attempting to log: >>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem >>> 531146 >>> at >>> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:702) >>> at >>> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728) >>> at >>> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:771) >>> at >>> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444) >>> at >>> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146) >>> at >>> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191) >>> at >>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performModification(DBInterfaceHSQLDB.java:750) >>> at >>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performUpdate(DBInterfaceHSQLDB.java:296) >>> at >>> org.apache.manifoldcf.core.database.BaseTable.performUpdate(BaseTable.java:80) >>> at >>> org.apache.manifoldcf.crawler.jobs.JobQueue.noDocPriorities(JobQueue.java:967) >>> at >>> org.apache.manifoldcf.crawler.jobs.JobManager.noDocPriorities(JobManager.java:8148) >>> at >>> org.apache.manifoldcf.crawler.jobs.JobManager.finishJobStops(JobManager.java:8123) >>> at >>> org.apache.manifoldcf.crawler.system.JobResetThread.run(JobResetThread.java:69) >>> Caused by: java.sql.SQLException: java.lang.RuntimeException: Logging >>> failed when attempting to log: >>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem >>> 531146 java.lang.RuntimeException: Logging failed when attempting to log: >>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem >>> 531146 >>> at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source) >>> at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source) >>> at org.hsqldb.jdbc.JDBCPreparedStatement.fetchResult(Unknown Source) >>> at org.hsqldb.jdbc.JDBCPreparedStatement.executeUpdate(Unknown Source) >>> at >>> org.apache.manifoldcf.core.database.Database.execute(Database.java:903) >>> at >>> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:683) >>> Caused by: org.hsqldb.HsqlException: java.lang.RuntimeException: Logging >>> failed when attempting to log: >>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem >>> 531146 >>> at org.hsqldb.error.Error.error(Unknown Source) >>> at org.hsqldb.result.Result.newErrorResult(Unknown Source) >>> at org.hsqldb.StatementDMQL.execute(Unknown Source) >>> at org.hsqldb.Session.executeCompiledStatement(Unknown Source) >>> at org.hsqldb.Session.execute(Unknown Source) >>> ... 4 more >>> Caused by: java.lang.RuntimeException: Logging failed when attempting to >>> log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of >>> mem 531146 >>> at org.hsqldb.lib.FrameworkLogger.privlog(Unknown Source) >>> at org.hsqldb.lib.FrameworkLogger.severe(Unknown Source) >>> at org.hsqldb.persist.Logger.logSevereEvent(Unknown Source) >>> at org.hsqldb.persist.DataFileCache.logSevereEvent(Unknown Source) >>> at org.hsqldb.persist.DataFileCache.getFromFile(Unknown Source) >>> at org.hsqldb.persist.DataFileCache.get(Unknown Source) >>> at org.hsqldb.persist.RowStoreAVLDisk.get(Unknown Source) >>> at org.hsqldb.index.NodeAVLDisk.findNode(Unknown Source) >>> at org.hsqldb.index.NodeAVLDisk.getRight(Unknown Source) >>> at org.hsqldb.index.IndexAVL.next(Unknown Source) >>> at org.hsqldb.index.IndexAVL.next(Unknown Source) >>> at org.hsqldb.index.IndexAVL$IndexRowIterator.getNextRow(Unknown Source) >>> at org.hsqldb.RangeVariable$RangeIteratorMain.findNext(Unknown Source) >>> at org.hsqldb.RangeVariable$RangeIteratorMain.next(Unknown Source) >>> at org.hsqldb.StatementDML.executeUpdateStatement(Unknown Source) >>> at org.hsqldb.StatementDML.getResult(Unknown Source) >>> ... 7 more >>> Caused by: java.lang.reflect.InvocationTargetException >>> at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>> at java.lang.reflect.Method.invoke(Method.java:483) >>> ... 23 more >>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded >>> WARN 2015-03-19 18:32:09,167 (Assessment thread) - Plan: >>> isDistinctSelect=[false] >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: >>> isGrouped=[false] >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: >>> isAggregated=[false] >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: columns=[ >>> COLUMN: PUBLIC.JOBS.ID not nullable >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: COLUMN: >>> PUBLIC.JOBS.STATUS not nullable >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: COLUMN: >>> PUBLIC.JOBS.CONNECTIONNAME not nullable >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ] >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: [range variable >>> 1 >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: join type=INNER >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: table=JOBS >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: cardinality=5 >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: access=FULL SCAN >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: join condition >>> = [index=SYS_IDX_SYS_PK_10234_10237 >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: other >>> condition=[ >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: EQUAL >>> arg_left=[ COLUMN: PUBLIC.JOBS.ASSESSMENTSTATE >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ] arg_right=[ >>> DYNAMIC PARAM: , TYPE = CHARACTER >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]] >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ] >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]] >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: PARAMETERS=[ >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: @0[DYNAMIC >>> PARAM: , TYPE = CHARACTER >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]] >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: SUBQUERIES[] >>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - >>> FATAL 2015-03-19 18:32:09,198 (Job notification thread) - >>> JobNotificationThread initialization error tossed: GC overhead limit >>> exceeded >>> java.lang.OutOfMemoryError: GC overhead limit exceeded >>> FATAL 2015-03-19 18:32:09,198 (Set priority thread) - SetPriorityThread >>> initialization error tossed: GC overhead limit exceeded >>> java.lang.OutOfMemoryError: GC overhead limit exceeded >>> FATAL 2015-03-19 18:32:09,198 (Finisher thread) - FinisherThread >>> initialization error tossed: GC overhead limit exceeded >>> java.lang.OutOfMemoryError: GC overhead limit exceeded >>> FATAL 2015-03-19 18:32:09,198 (Job delete thread) - JobDeleteThread >>> initialization error tossed: GC overhead limit exceeded >>> java.lang.OutOfMemoryError: GC overhead limit exceeded >>> FATAL 2015-03-19 18:32:09,198 (Seeding thread) - SeedingThread >>> initialization error tossed: GC overhead limit exceeded >>> java.lang.OutOfMemoryError: GC overhead limit exceeded >>> >>> >>> Karl Wright <[email protected]> 3/19/2015 3:34 PM >>> >>> Hi Ian, >>> >>> ManifoldCF operates under what is known as a "bounded" memory model. >>> That means that you should always be able to find a memory size that works >>> (that isn't huge). >>> >>> The only exception to this is for Solr indexing that does *not* go via >>> the extracting update handler. The standard update handler unfortunately >>> *requires* that the entire document fit in memory. If this is what you are >>> doing, you must take steps to limit the maximum document size to prevent >>> OOM's. >>> >>> 160,000 documents is quite small by MCF standards (we do 10 million to >>> 50 million on some setups). So let's diagnose your problem before taking >>> any bizarre actions. Can you provide an out-of-memory dump from the log, >>> for instance? Can you let us know what deployment model you are using (e.g. >>> single-process, etc.)? >>> >>> Thanks, >>> Karl >>> >>> >>> On Thu, Mar 19, 2015 at 3:07 PM, Ian Zapczynski < >>> [email protected]> wrote: >>> >>>> Hello all. I am using ManifoldCF to index a Windows share containing >>>> well over 160,000 files (.xls, .pdf, .doc). I keep getting memory errors >>>> when I try to index the whole folder at once and have not been able to >>>> resolve this by throwing memory and CPU at Tomcat and the VM, so I thought >>>> I'd try this a different way. >>>> What I'd like to do now is break what was a single job up into >>>> multiple jobs. Each job should index all indexable files under a parent >>>> folder, with one job indexing folders whose names begin with the letters >>>> A-G as well as all subfolders and files within, another job for H-M also >>>> with all subfolders/files, and so on. My problem is, somehow I can't manage >>>> to figure out what expression to use to get it to index what I want. >>>> In the Job settings under Paths, I have specified the parent folder, >>>> and within there I've tried: >>>> 1. Include file(s) or directory(s) matching * (this works, but >>>> indexes every file in every folder within the parent, eventually causing me >>>> unresolvable GC memory overhead errors) >>>> 2. Include file(s) or directory(s) matching ^(?i)[A-G]* (this does not >>>> work; it supposedly indexes one file and then quits) >>>> 3. Include file(s) or directory(s) matching A* (this does not work; it >>>> supposedly indexes one file and then quits, and there are many folders >>>> directly under the parent that begin with 'A') >>>> Can anyone help confirm what type of expression I should use in the >>>> paths to accomplish what I want? >>>> Or alternately if you think I should be able to index 160,000+ files >>>> in one job without getting GC memory overhead errors, I'm open to hear your >>>> suggestions on resolving those. All I know to do is increase the maximum >>>> memory in Tomcat as well as on the OS, and that didn't help at all. >>>> Thanks much! >>>> -Ian >>>> >>> >>> >> >
