" failure processing document Server athttp://localhost:8983/solr returned non ok status:500, message:Server Error""
That's an error occurring on Solr. What do the Solr logs say? Karl On Tue, Mar 24, 2015 at 11:11 AM, Ian Zapczynski < [email protected]> wrote: > Unfortunately I'm still getting stuck indexing. It at least appears to > me that I have such a large number of password protected docs and scanned > PDFs without OCR enabled that the job is dying on me before it even finds > all the "good" docs. It will die with the error "Error: Repeated service > interruptions - failure processing document Server at > http://localhost:8983/solr returned non ok status:500, message:Server > Error". My job tells me there are 168,595 documents, with 73,238 > currently active, and 106,291 processed. At this point, if I keep > restarting the job, it slowly adds a small number of new docs, but then > dies again with the same error. The thing is, it has not indexed a large > number of documents that should be indexable. > > To help clarify, it may be helpful to note that I am indexing a folder > that contains thousands of folders within that which are named after > companies we have associated with, with several files and folders within > each. If I test searches in SOLR by reviewing a document and then > performing a search based on that text, I get the expected search results > consistently when I am searching files that are within folders with company > names beginning with A, B, C, D, etc. However, I do not get results if I > search for files within folders with company names beginning with > R,S,T,U,V, etc. > > We are looking into whether we should batch convert the scanned PDFs to > support OCR and thereby cut down on the number of problem docs, but for > now, I'd like to just get all of the indexable documents into SOLR. > > Going back to my original question, should I consider breaking this single > job into multiple jobs based on the letter of the alphabet? If so, I > haven't been able to figure out a working regex to tell it to just pick up > all files and folders within a folder which name begins with R-Z, for > example. And if not that workaround, where do you suggest I go to > resolve this? I'm not entirely sure what is causing all of the files and > folders not to be traversed before my job dies (is this a ManifoldCF thing > or a SOLR thing?) > > Thanks again for your help. > > > >>> Ian Zapczynski 3/20/2015 2:25 PM >>> > Thanks for the help, Karl. Yup, I was using the simple-to-set-up single > process configuration, and silly me.... after I restarted from scratch at > one point, I completely failed to update the combined-options-env.win > config file that you referred to, so MCF was still set to use only 256 Mb > despite my thinking otherwise. I've bumped it up to 4 Gb, and the job > recovered and is finally again moving along. > -Ian > > >>> Karl Wright <[email protected]> 3/20/2015 10:55 AM >>> > Hi Ian, > > HSQLDB is an interesting database in that it is *not* memory constrained. > It attempts to keep everything in memory. > > I'd strongly suggest either giving the MCF agents process a lot more > memory, say 2G, if you want to keep using hsqldb. A better choice would be > postgresql or mysql. There's a configuration file where you can put java > switches for all of the processes; start by doing that. > > Thanks, > Karl > > > > > On Fri, Mar 20, 2015 at 9:29 AM, Ian Zapczynski < > [email protected]> wrote: > >> Hi Karl, >> I have SOLR and ManifoldCF running with Tomcat on a Windows 2012 R2 >> server. Linux would have been my preference, but various logistics >> prevented me from using that. I have set the maximum document length to be >> 3072000. I chose a larger size than what might be normal because when I >> first did a test, I could see that a lot of docs were getting rejected >> based on size, and it seems folks around here don't reduce/shrink the size >> of their PDFs. >> The errors from the log are below. I was more busy paying attention to >> the errors spit out to the console, which didn't so obviously point to the >> backend database being the culprit. I'm guessing that I'm pushing the >> database too hard and should really be using PostgreSQL, right? I don't >> know why, but I didn't see or reference the deployment documentation that >> covered using various other databases until now. I was working off of the >> ManifoldCF End User Documentation as well as a (mostly) helpful blog post I >> found elsewhere. >> Much thanks, >> -Ian >> WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Solr exception >> during indexing >> file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file1.pdf >> (500): Server at http://localhost:8983/solr returned non ok status:500, >> message:Server Error >> org.apache.solr.common.SolrException: Server at >> http://localhost:8983/solr returned non ok status:500, message:Server >> Error >> at >> org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:303) >> at >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) >> at >> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) >> at >> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:894) >> WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Service interruption >> reported for job 1426796577848 connection 'MACLSTR file server': Solr >> exception during indexing >> file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file3.pdf >> (500): Server at http://localhost:8983/solr returned non ok status:500, >> message:Server Error >> ERROR 2015-03-19 18:31:45,730 (Job delete thread) - Job delete thread >> aborting and restarting due to database connection reset: Database >> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC >> overhead limit exceeded >> ERROR 2015-03-19 18:31:45,309 (Finisher thread) - Finisher thread >> aborting and restarting due to database connection reset: Database >> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC >> overhead limit exceeded >> ERROR 2015-03-19 18:31:43,043 (Set priority thread) - Set priority thread >> aborting and restarting due to database connection reset: Database >> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC >> overhead limit exceeded >> ERROR 2015-03-19 18:32:02,292 (Job notification thread) - Job >> notification thread aborting and restarting due to database connection >> reset: Database exception: SQLException doing query (S1000): >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> FATAL 2015-03-19 18:32:05,870 (Thread-3838608) - >> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem >> 531146 >> WARN 2015-03-19 18:32:09,167 (Assessment thread) - Found a long-running >> query (64919 ms): [SELECT id,status,connectionname FROM jobs WHERE >> assessmentstate=? FOR UPDATE] >> WARN 2015-03-19 18:32:09,167 (Assessment thread) - Parameter 0: 'N' >> ERROR 2015-03-19 18:32:09,167 (Job reset thread) - Job reset thread >> aborting and restarting due to database connection reset: Database >> exception: SQLException doing query (S1000): java.lang.RuntimeException: >> Logging failed when attempting to log: >> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem >> 531146 java.lang.RuntimeException: Logging failed when attempting to log: >> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem >> 531146 >> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database >> exception: SQLException doing query (S1000): java.lang.RuntimeException: >> Logging failed when attempting to log: >> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem >> 531146 java.lang.RuntimeException: Logging failed when attempting to log: >> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem >> 531146 >> at >> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:702) >> at >> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728) >> at >> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:771) >> at >> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444) >> at >> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146) >> at >> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191) >> at >> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performModification(DBInterfaceHSQLDB.java:750) >> at >> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performUpdate(DBInterfaceHSQLDB.java:296) >> at >> org.apache.manifoldcf.core.database.BaseTable.performUpdate(BaseTable.java:80) >> at >> org.apache.manifoldcf.crawler.jobs.JobQueue.noDocPriorities(JobQueue.java:967) >> at >> org.apache.manifoldcf.crawler.jobs.JobManager.noDocPriorities(JobManager.java:8148) >> at >> org.apache.manifoldcf.crawler.jobs.JobManager.finishJobStops(JobManager.java:8123) >> at >> org.apache.manifoldcf.crawler.system.JobResetThread.run(JobResetThread.java:69) >> Caused by: java.sql.SQLException: java.lang.RuntimeException: Logging >> failed when attempting to log: >> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem >> 531146 java.lang.RuntimeException: Logging failed when attempting to log: >> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem >> 531146 >> at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source) >> at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source) >> at org.hsqldb.jdbc.JDBCPreparedStatement.fetchResult(Unknown Source) >> at org.hsqldb.jdbc.JDBCPreparedStatement.executeUpdate(Unknown Source) >> at org.apache.manifoldcf.core.database.Database.execute(Database.java:903) >> at >> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:683) >> Caused by: org.hsqldb.HsqlException: java.lang.RuntimeException: Logging >> failed when attempting to log: >> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem >> 531146 >> at org.hsqldb.error.Error.error(Unknown Source) >> at org.hsqldb.result.Result.newErrorResult(Unknown Source) >> at org.hsqldb.StatementDMQL.execute(Unknown Source) >> at org.hsqldb.Session.executeCompiledStatement(Unknown Source) >> at org.hsqldb.Session.execute(Unknown Source) >> ... 4 more >> Caused by: java.lang.RuntimeException: Logging failed when attempting to >> log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of >> mem 531146 >> at org.hsqldb.lib.FrameworkLogger.privlog(Unknown Source) >> at org.hsqldb.lib.FrameworkLogger.severe(Unknown Source) >> at org.hsqldb.persist.Logger.logSevereEvent(Unknown Source) >> at org.hsqldb.persist.DataFileCache.logSevereEvent(Unknown Source) >> at org.hsqldb.persist.DataFileCache.getFromFile(Unknown Source) >> at org.hsqldb.persist.DataFileCache.get(Unknown Source) >> at org.hsqldb.persist.RowStoreAVLDisk.get(Unknown Source) >> at org.hsqldb.index.NodeAVLDisk.findNode(Unknown Source) >> at org.hsqldb.index.NodeAVLDisk.getRight(Unknown Source) >> at org.hsqldb.index.IndexAVL.next(Unknown Source) >> at org.hsqldb.index.IndexAVL.next(Unknown Source) >> at org.hsqldb.index.IndexAVL$IndexRowIterator.getNextRow(Unknown Source) >> at org.hsqldb.RangeVariable$RangeIteratorMain.findNext(Unknown Source) >> at org.hsqldb.RangeVariable$RangeIteratorMain.next(Unknown Source) >> at org.hsqldb.StatementDML.executeUpdateStatement(Unknown Source) >> at org.hsqldb.StatementDML.getResult(Unknown Source) >> ... 7 more >> Caused by: java.lang.reflect.InvocationTargetException >> at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:483) >> ... 23 more >> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded >> WARN 2015-03-19 18:32:09,167 (Assessment thread) - Plan: >> isDistinctSelect=[false] >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: isGrouped=[false] >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: >> isAggregated=[false] >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: columns=[ >> COLUMN: PUBLIC.JOBS.ID not nullable >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: COLUMN: >> PUBLIC.JOBS.STATUS not nullable >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: COLUMN: >> PUBLIC.JOBS.CONNECTIONNAME not nullable >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ] >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: [range variable 1 >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: join type=INNER >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: table=JOBS >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: cardinality=5 >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: access=FULL SCAN >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: join condition = >> [index=SYS_IDX_SYS_PK_10234_10237 >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: other condition=[ >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: EQUAL arg_left=[ >> COLUMN: PUBLIC.JOBS.ASSESSMENTSTATE >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ] arg_right=[ >> DYNAMIC PARAM: , TYPE = CHARACTER >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]] >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ] >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]] >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: PARAMETERS=[ >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: @0[DYNAMIC >> PARAM: , TYPE = CHARACTER >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]] >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: SUBQUERIES[] >> WARN 2015-03-19 18:32:09,182 (Assessment thread) - >> FATAL 2015-03-19 18:32:09,198 (Job notification thread) - >> JobNotificationThread initialization error tossed: GC overhead limit >> exceeded >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> FATAL 2015-03-19 18:32:09,198 (Set priority thread) - SetPriorityThread >> initialization error tossed: GC overhead limit exceeded >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> FATAL 2015-03-19 18:32:09,198 (Finisher thread) - FinisherThread >> initialization error tossed: GC overhead limit exceeded >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> FATAL 2015-03-19 18:32:09,198 (Job delete thread) - JobDeleteThread >> initialization error tossed: GC overhead limit exceeded >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> FATAL 2015-03-19 18:32:09,198 (Seeding thread) - SeedingThread >> initialization error tossed: GC overhead limit exceeded >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> >> >>> Karl Wright <[email protected]> 3/19/2015 3:34 PM >>> >> Hi Ian, >> >> ManifoldCF operates under what is known as a "bounded" memory model. That >> means that you should always be able to find a memory size that works (that >> isn't huge). >> >> The only exception to this is for Solr indexing that does *not* go via >> the extracting update handler. The standard update handler unfortunately >> *requires* that the entire document fit in memory. If this is what you are >> doing, you must take steps to limit the maximum document size to prevent >> OOM's. >> >> 160,000 documents is quite small by MCF standards (we do 10 million to 50 >> million on some setups). So let's diagnose your problem before taking any >> bizarre actions. Can you provide an out-of-memory dump from the log, for >> instance? Can you let us know what deployment model you are using (e.g. >> single-process, etc.)? >> >> Thanks, >> Karl >> >> >> On Thu, Mar 19, 2015 at 3:07 PM, Ian Zapczynski < >> [email protected]> wrote: >> >>> Hello all. I am using ManifoldCF to index a Windows share containing >>> well over 160,000 files (.xls, .pdf, .doc). I keep getting memory errors >>> when I try to index the whole folder at once and have not been able to >>> resolve this by throwing memory and CPU at Tomcat and the VM, so I thought >>> I'd try this a different way. >>> What I'd like to do now is break what was a single job up into >>> multiple jobs. Each job should index all indexable files under a parent >>> folder, with one job indexing folders whose names begin with the letters >>> A-G as well as all subfolders and files within, another job for H-M also >>> with all subfolders/files, and so on. My problem is, somehow I can't manage >>> to figure out what expression to use to get it to index what I want. >>> In the Job settings under Paths, I have specified the parent folder, >>> and within there I've tried: >>> 1. Include file(s) or directory(s) matching * (this works, but indexes >>> every file in every folder within the parent, eventually causing me >>> unresolvable GC memory overhead errors) >>> 2. Include file(s) or directory(s) matching ^(?i)[A-G]* (this does not >>> work; it supposedly indexes one file and then quits) >>> 3. Include file(s) or directory(s) matching A* (this does not work; it >>> supposedly indexes one file and then quits, and there are many folders >>> directly under the parent that begin with 'A') >>> Can anyone help confirm what type of expression I should use in the >>> paths to accomplish what I want? >>> Or alternately if you think I should be able to index 160,000+ files >>> in one job without getting GC memory overhead errors, I'm open to hear your >>> suggestions on resolving those. All I know to do is increase the maximum >>> memory in Tomcat as well as on the OS, and that didn't help at all. >>> Thanks much! >>> -Ian >>> >> >> >
