Hi Ian, HSQLDB is an interesting database in that it is *not* memory constrained. It attempts to keep everything in memory.
I'd strongly suggest either giving the MCF agents process a lot more memory, say 2G, if you want to keep using hsqldb. A better choice would be postgresql or mysql. There's a configuration file where you can put java switches for all of the processes; start by doing that. Thanks, Karl On Fri, Mar 20, 2015 at 9:29 AM, Ian Zapczynski < [email protected]> wrote: > Hi Karl, > > I have SOLR and ManifoldCF running with Tomcat on a Windows 2012 R2 > server. Linux would have been my preference, but various logistics > prevented me from using that. I have set the maximum document length to be > 3072000. I chose a larger size than what might be normal because when I > first did a test, I could see that a lot of docs were getting rejected > based on size, and it seems folks around here don't reduce/shrink the size > of their PDFs. > > The errors from the log are below. I was more busy paying attention to > the errors spit out to the console, which didn't so obviously point to the > backend database being the culprit. I'm guessing that I'm pushing the > database too hard and should really be using PostgreSQL, right? I don't > know why, but I didn't see or reference the deployment documentation that > covered using various other databases until now. I was working off of > the ManifoldCF End User Documentation as well as a (mostly) helpful blog > post I found elsewhere. > > Much thanks, > > -Ian > > > WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Solr exception during > indexing > file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file1.pdf > (500): Server at http://localhost:8983/solr returned non ok status:500, > message:Server Error > org.apache.solr.common.SolrException: Server at http://localhost:8983/solr > returned non ok status:500, message:Server Error > at > org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:303) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) > at > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) > at > org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:894) > WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Service interruption > reported for job 1426796577848 connection 'MACLSTR file server': Solr > exception during indexing > file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file3.pdf > (500): Server at http://localhost:8983/solr returned non ok status:500, > message:Server Error > ERROR 2015-03-19 18:31:45,730 (Job delete thread) - Job delete thread > aborting and restarting due to database connection reset: Database > exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC > overhead limit exceeded > ERROR 2015-03-19 18:31:45,309 (Finisher thread) - Finisher thread aborting > and restarting due to database connection reset: Database exception: > SQLException doing query (S1000): java.lang.OutOfMemoryError: GC overhead > limit exceeded > ERROR 2015-03-19 18:31:43,043 (Set priority thread) - Set priority thread > aborting and restarting due to database connection reset: Database > exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC > overhead limit exceeded > ERROR 2015-03-19 18:32:02,292 (Job notification thread) - Job notification > thread aborting and restarting due to database connection reset: Database > exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC > overhead limit exceeded > FATAL 2015-03-19 18:32:05,870 (Thread-3838608) - > C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem > 531146 > WARN 2015-03-19 18:32:09,167 (Assessment thread) - Found a long-running > query (64919 ms): [SELECT id,status,connectionname FROM jobs WHERE > assessmentstate=? FOR UPDATE] > WARN 2015-03-19 18:32:09,167 (Assessment thread) - Parameter 0: 'N' > ERROR 2015-03-19 18:32:09,167 (Job reset thread) - Job reset thread > aborting and restarting due to database connection reset: Database > exception: SQLException doing query (S1000): java.lang.RuntimeException: > Logging failed when attempting to log: > C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem > 531146 java.lang.RuntimeException: Logging failed when attempting to log: > C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem > 531146 > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database > exception: SQLException doing query (S1000): java.lang.RuntimeException: > Logging failed when attempting to log: > C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem > 531146 java.lang.RuntimeException: Logging failed when attempting to log: > C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem > 531146 > at > org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:702) > at > org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728) > at > org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:771) > at > org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444) > at > org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146) > at > org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191) > at > org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performModification(DBInterfaceHSQLDB.java:750) > at > org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performUpdate(DBInterfaceHSQLDB.java:296) > at > org.apache.manifoldcf.core.database.BaseTable.performUpdate(BaseTable.java:80) > at > org.apache.manifoldcf.crawler.jobs.JobQueue.noDocPriorities(JobQueue.java:967) > at > org.apache.manifoldcf.crawler.jobs.JobManager.noDocPriorities(JobManager.java:8148) > at > org.apache.manifoldcf.crawler.jobs.JobManager.finishJobStops(JobManager.java:8123) > at > org.apache.manifoldcf.crawler.system.JobResetThread.run(JobResetThread.java:69) > Caused by: java.sql.SQLException: java.lang.RuntimeException: Logging > failed when attempting to log: > C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem > 531146 java.lang.RuntimeException: Logging failed when attempting to log: > C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem > 531146 > at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source) > at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source) > at org.hsqldb.jdbc.JDBCPreparedStatement.fetchResult(Unknown Source) > at org.hsqldb.jdbc.JDBCPreparedStatement.executeUpdate(Unknown Source) > at org.apache.manifoldcf.core.database.Database.execute(Database.java:903) > at > org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:683) > Caused by: org.hsqldb.HsqlException: java.lang.RuntimeException: Logging > failed when attempting to log: > C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem > 531146 > at org.hsqldb.error.Error.error(Unknown Source) > at org.hsqldb.result.Result.newErrorResult(Unknown Source) > at org.hsqldb.StatementDMQL.execute(Unknown Source) > at org.hsqldb.Session.executeCompiledStatement(Unknown Source) > at org.hsqldb.Session.execute(Unknown Source) > ... 4 more > Caused by: java.lang.RuntimeException: Logging failed when attempting to > log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of > mem 531146 > at org.hsqldb.lib.FrameworkLogger.privlog(Unknown Source) > at org.hsqldb.lib.FrameworkLogger.severe(Unknown Source) > at org.hsqldb.persist.Logger.logSevereEvent(Unknown Source) > at org.hsqldb.persist.DataFileCache.logSevereEvent(Unknown Source) > at org.hsqldb.persist.DataFileCache.getFromFile(Unknown Source) > at org.hsqldb.persist.DataFileCache.get(Unknown Source) > at org.hsqldb.persist.RowStoreAVLDisk.get(Unknown Source) > at org.hsqldb.index.NodeAVLDisk.findNode(Unknown Source) > at org.hsqldb.index.NodeAVLDisk.getRight(Unknown Source) > at org.hsqldb.index.IndexAVL.next(Unknown Source) > at org.hsqldb.index.IndexAVL.next(Unknown Source) > at org.hsqldb.index.IndexAVL$IndexRowIterator.getNextRow(Unknown Source) > at org.hsqldb.RangeVariable$RangeIteratorMain.findNext(Unknown Source) > at org.hsqldb.RangeVariable$RangeIteratorMain.next(Unknown Source) > at org.hsqldb.StatementDML.executeUpdateStatement(Unknown Source) > at org.hsqldb.StatementDML.getResult(Unknown Source) > ... 7 more > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > ... 23 more > Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded > WARN 2015-03-19 18:32:09,167 (Assessment thread) - Plan: > isDistinctSelect=[false] > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: > isGrouped=[false] > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: > isAggregated=[false] > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: columns=[ > COLUMN: PUBLIC.JOBS.ID not nullable > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: COLUMN: > PUBLIC.JOBS.STATUS not nullable > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: COLUMN: > PUBLIC.JOBS.CONNECTIONNAME not nullable > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ] > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: [range variable > 1 > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: join > type=INNER > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: table=JOBS > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: cardinality=5 > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: access=FULL > SCAN > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: join > condition = [index=SYS_IDX_SYS_PK_10234_10237 > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: other > condition=[ > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: EQUAL > arg_left=[ COLUMN: PUBLIC.JOBS.ASSESSMENTSTATE > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ] > arg_right=[ DYNAMIC PARAM: , TYPE = CHARACTER > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]] > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ] > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]] > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: PARAMETERS=[ > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: @0[DYNAMIC > PARAM: , TYPE = CHARACTER > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]] > WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: SUBQUERIES[] > WARN 2015-03-19 18:32:09,182 (Assessment thread) - > FATAL 2015-03-19 18:32:09,198 (Job notification thread) - > JobNotificationThread initialization error tossed: GC overhead limit > exceeded > java.lang.OutOfMemoryError: GC overhead limit exceeded > FATAL 2015-03-19 18:32:09,198 (Set priority thread) - SetPriorityThread > initialization error tossed: GC overhead limit exceeded > java.lang.OutOfMemoryError: GC overhead limit exceeded > FATAL 2015-03-19 18:32:09,198 (Finisher thread) - FinisherThread > initialization error tossed: GC overhead limit exceeded > java.lang.OutOfMemoryError: GC overhead limit exceeded > FATAL 2015-03-19 18:32:09,198 (Job delete thread) - JobDeleteThread > initialization error tossed: GC overhead limit exceeded > java.lang.OutOfMemoryError: GC overhead limit exceeded > FATAL 2015-03-19 18:32:09,198 (Seeding thread) - SeedingThread > initialization error tossed: GC overhead limit exceeded > java.lang.OutOfMemoryError: GC overhead limit exceeded > > >>> Karl Wright <[email protected]> 3/19/2015 3:34 PM >>> > Hi Ian, > > ManifoldCF operates under what is known as a "bounded" memory model. That > means that you should always be able to find a memory size that works (that > isn't huge). > > The only exception to this is for Solr indexing that does *not* go via the > extracting update handler. The standard update handler unfortunately > *requires* that the entire document fit in memory. If this is what you are > doing, you must take steps to limit the maximum document size to prevent > OOM's. > > 160,000 documents is quite small by MCF standards (we do 10 million to 50 > million on some setups). So let's diagnose your problem before taking any > bizarre actions. Can you provide an out-of-memory dump from the log, for > instance? Can you let us know what deployment model you are using (e.g. > single-process, etc.)? > > Thanks, > Karl > > > On Thu, Mar 19, 2015 at 3:07 PM, Ian Zapczynski < > [email protected]> wrote: > >> Hello all. I am using ManifoldCF to index a Windows share containing >> well over 160,000 files (.xls, .pdf, .doc). I keep getting memory errors >> when I try to index the whole folder at once and have not been able to >> resolve this by throwing memory and CPU at Tomcat and the VM, so I thought >> I'd try this a different way. >> What I'd like to do now is break what was a single job up into multiple >> jobs. Each job should index all indexable files under a parent folder, with >> one job indexing folders whose names begin with the letters A-G as well as >> all subfolders and files within, another job for H-M also with all >> subfolders/files, and so on. My problem is, somehow I can't manage to >> figure out what expression to use to get it to index what I want. >> In the Job settings under Paths, I have specified the parent folder, >> and within there I've tried: >> 1. Include file(s) or directory(s) matching * (this works, but indexes >> every file in every folder within the parent, eventually causing me >> unresolvable GC memory overhead errors) >> 2. Include file(s) or directory(s) matching ^(?i)[A-G]* (this does not >> work; it supposedly indexes one file and then quits) >> 3. Include file(s) or directory(s) matching A* (this does not work; it >> supposedly indexes one file and then quits, and there are many folders >> directly under the parent that begin with 'A') >> Can anyone help confirm what type of expression I should use in the >> paths to accomplish what I want? >> Or alternately if you think I should be able to index 160,000+ files in >> one job without getting GC memory overhead errors, I'm open to hear your >> suggestions on resolving those. All I know to do is increase the maximum >> memory in Tomcat as well as on the OS, and that didn't help at all. >> Thanks much! >> -Ian >> > >
