Hi Ian, If you can can connect to your HSQLDB instance, you can simply drop all rows from the table "repohistory". That should make a difference. Of course it is possible that the database instance is corrupt now and nothing can be done to fix it up.
Once you get back to a point where queries will work against your HSQLDB instance, only then will the configuration changes to control simple history table bloat work. If you need to recreate everything, I do suggest you do it on Postgresql, since it's easier to manage than HSQLDB and is meant for far larger database instances. Thanks, Karl On Fri, Mar 18, 2016 at 9:44 AM, Ian Zapczynski < [email protected]> wrote: > Karl, > > Wow... 100 Mb vs. my 32+ Gb is certainly perplexing! > > I dropped HistoryCleanupInterval in properties.xml to 302400000 ms and > have restarted and waited, but I don't see a difference in .data file > size. I tried to connect to HyperSQL directly and run a CHECKPOINT DEFRAG > and SHUTDOWN COMPACT, but I must not be doing these correctly as the > commands came back immediately with no effect whatsoever. > > Unless you think otherwise, I feel like I'm now only faced with a > few options: > > 1) Delete the database and re-run the job to reindex all files. The > problem will likely eventually return. > 2) Upgrade ManifoldCF to a recent release and see if the database > magically shrinks. Is there any logical hope in doing this? > 3) Begin using PostgreSQL instead. This won't tell me what I'm > apparently doing wrong, but it will give me more flexibility with database > maintenance. > > What do you think? > > -Ian > > >>> Karl Wright <[email protected]> 3/16/2016 2:10 PM >>> > Hi Ian, > > This all looks very straightforward. Typical sizes of an HSQLDB database > under this scenario would probably run well under 100M. What might be > happening, though, is that you might be accumulating a huge history table. > This would bloat your database until it falls over (which for HSQLDB is at > 32GB). > > History records are used only for generation of reports. Normally MCF out > of the box is configured to drop history rows older than a month. But if > you are doing lots of crawling and want to stick with HSQLDB you might want > to do it faster than that. There's a properties.xml parameter you can set > to control the time interval these records are kept; see the > how-to-build-and-deploy page. > > Thanks, > Karl > > > On Wed, Mar 16, 2016 at 1:05 PM, Ian Zapczynski < > [email protected]> wrote: > >> Thanks, Karl. >> I am using a single Windows shares repository connection to a folder on >> our file server which currently contains a total of 143,997 files and >> 54,424 folders (59.2 Gb of total data) of which ManifoldCF seems to >> identify just over 108,000 as indexable. The job specifies the following: >> 1. Include indexable file(s) matching * >> 2. Include directory(s) matching * >> >> No custom connectors. I kept this simple because I'm a simple guy. :-) As >> such, it's entirely possible that I did something stupid when I set it up, >> but I'm not seeing anything else obvious that seems worth pointing out. >> -Ian >> >> >>> Karl Wright <[email protected]> 3/16/2016 12:03 PM >>> >> Hi Ian, >> >> The database size seems way too big for this crawl size. I've not seen >> this problem before but I suspect that whatever is causing the bloat is >> also causing HSQLDB to fail. >> >> Can you give me further details about what repository connections you are >> using? It is possible that there's a heretofore unknown pathological case >> you are running into during the crawl. Are there any custom connectors >> involved? >> >> If we rule out a bug of some kind, then the next thing to do would be to >> go to a real database, e.g. PostgreSQL. >> >> Karl >> >> >> On Wed, Mar 16, 2016 at 11:04 AM, Ian Zapczynski < >> [email protected]> wrote: >> >>> Hello, >>> We've had ManifoldCF 2.0.1 working well with SOLR for months on Windows >>> 2012 using the single process model. We recently just noticed that new >>> documents are not getting ingested, even after restarting the job, the >>> server, etc. What I see in the logs are first a bunch of 500 errors coming >>> out of SOLR as a result of ManifoldCF trying to index .tif files that are >>> found in the directory structure being indexed. After that (not sure if >>> related or not), I see a bunch of these errors: >>> FATAL 2016-03-15 16:01:48,801 (Thread-1387745) - >>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile failed >>> 33337202 >>> org.hsqldb.HsqlException: java.lang.NegativeArraySizeException >>> at org.hsqldb.error.Error.error(Unknown Source) >>> at org.hsqldb.persist.DataFileCache.getFromFile(Unknown Source) >>> at org.hsqldb.persist.DataFileCache.get(Unknown Source) >>> at org.hsqldb.persist.RowStoreAVLDisk.get(Unknown Source) >>> at org.hsqldb.index.NodeAVLDisk.findNode(Unknown Source) >>> at org.hsqldb.index.NodeAVLDisk.getRight(Unknown Source) >>> at org.hsqldb.index.IndexAVL.next(Unknown Source) >>> at org.hsqldb.index.IndexAVL.next(Unknown Source) >>> at org.hsqldb.index.IndexAVL$IndexRowIterator.getNextRow(Unknown Source) >>> at org.hsqldb.RangeVariable$RangeIteratorMain.findNext(Unknown Source) >>> at org.hsqldb.RangeVariable$RangeIteratorMain.next(Unknown Source) >>> at org.hsqldb.QuerySpecification.buildResult(Unknown Source) >>> at org.hsqldb.QuerySpecification.getSingleResult(Unknown Source) >>> at org.hsqldb.QuerySpecification.getResult(Unknown Source) >>> at org.hsqldb.StatementQuery.getResult(Unknown Source) >>> at org.hsqldb.StatementDMQL.execute(Unknown Source) >>> at org.hsqldb.Session.executeCompiledStatement(Unknown Source) >>> at org.hsqldb.Session.execute(Unknown Source) >>> at org.hsqldb.jdbc.JDBCPreparedStatement.fetchResult(Unknown Source) >>> at org.hsqldb.jdbc.JDBCPreparedStatement.executeQuery(Unknown Source) >>> at >>> org.apache.manifoldcf.core.database.Database.execute(Database.java:889) >>> at >>> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:683) >>> Caused by: java.lang.NegativeArraySizeException >>> at org.hsqldb.lib.StringConverter.readUTF(Unknown Source) >>> at org.hsqldb.rowio.RowInputBinary.readString(Unknown Source) >>> at org.hsqldb.rowio.RowInputBinary.readChar(Unknown Source) >>> at org.hsqldb.rowio.RowInputBase.readData(Unknown Source) >>> at org.hsqldb.rowio.RowInputBinary.readData(Unknown Source) >>> at org.hsqldb.rowio.RowInputBase.readData(Unknown Source) >>> at org.hsqldb.rowio.RowInputBinary.readData(Unknown Source) >>> at org.hsqldb.rowio.RowInputBinaryDecode.readData(Unknown Source) >>> at org.hsqldb.RowAVLDisk.<init>(Unknown Source) >>> at org.hsqldb.persist.RowStoreAVLDisk.get(Unknown Source) >>> ... 21 more >>> ERROR 2016-03-15 16:01:48,911 (Stuffer thread) - Stuffer thread aborting >>> and restarting due to database connection reset: Database exception: >>> SQLException doing query (S1000): java.lang.NegativeArraySizeException >>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database >>> exception: SQLException doing query (S1000): >>> java.lang.NegativeArraySizeException >>> at >>> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:702) >>> at >>> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728) >>> at >>> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:771) >>> at >>> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444) >>> at >>> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146) >>> at >>> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191) >>> at >>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performQuery(DBInterfaceHSQLDB.java:916) >>> at >>> org.apache.manifoldcf.core.database.BaseTable.performQuery(BaseTable.java:221) >>> at >>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getPipelineDocumentIngestDataChunk(IncrementalIngester.java:1783) >>> at >>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getPipelineDocumentIngestDataMultiple(IncrementalIngester.java:1748) >>> at >>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getPipelineDocumentIngestDataMultiple(IncrementalIngester.java:1703) >>> at >>> org.apache.manifoldcf.crawler.system.StufferThread.run(StufferThread.java:254) >>> Caused by: java.sql.SQLException: java.lang.NegativeArraySizeException >>> at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source) >>> at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source) >>> at org.hsqldb.jdbc.JDBCPreparedStatement.fetchResult(Unknown Source) >>> at org.hsqldb.jdbc.JDBCPreparedStatement.executeQuery(Unknown Source) >>> at >>> org.apache.manifoldcf.core.database.Database.execute(Database.java:889) >>> at >>> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:683) >>> Caused by: org.hsqldb.HsqlException: java.lang.NegativeArraySizeException >>> After these errors occur, the job just seems to hang and not process any >>> further documents or log anything more in the manifoldcf.log. So I see the >>> error is coming out of the HyperSQL database, but I don't know why. There >>> is sufficient disk space. Now the database file is 33 Gb (larger than I'd >>> expect for our ~110,000 documents), but I haven't seen any evidence that >>> we're hitting a limit on file size. I'm afraid I'm not sure where to go >>> from here to further nail down the problem. >>> As always, any and all help is much appreciated. >>> Thanks, >>> -Ian >>> >> >> >
