Re: Need examples of expressions used to specify multiple folders to index

Ian Zapczynski Tue, 24 Mar 2015 08:52:58 -0700

I mostly get these repeated many, many times over:
 
ERROR - 2015-03-24 14:48:40.321; org.apache.solr.common.SolrException; 
null:org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.pdf.PDFParser@46a9acab
 at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
<snip>
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.pdf.PDFParser@46a9acab
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
<snip>
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of 
range: 0
 at java.lang.String.charAt(String.java:646)
 
Then of course I get some of these, which are expected when we have encrypted 
or password-protected files:
 
ERROR - 2015-03-24 14:48:40.962; org.apache.solr.common.SolrException; 
org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: 
Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@38902a7
 at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
 <snip>
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.microsoft.OfficeParser@38902a7
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
<snip>
Caused by: org.apache.poi.EncryptedDocumentException: Cannot process encrypted 
word file

>>> Karl Wright <[email protected]> 3/24/2015 11:22 AM >>>
" failure processing document Server athttp://localhost:8983/solr returned non 
ok status:500, message:Server Error""

That's an error occurring on Solr. What do the Solr logs say?

Karl

On Tue, Mar 24, 2015 at 11:11 AM, Ian Zapczynski 
<[email protected]> wrote:

Unfortunately I'm still getting stuck indexing. It at least appears to me that 
I have such a large number of password protected docs and scanned PDFs without 
OCR enabled that the job is dying on me before it even finds all the "good" 
docs. It will die with the error "Error: Repeated service interruptions - 
failure processing document Server at http://localhost:8983/solr returned non 
ok status:500, message:Server Error". My job tells me there are 168,595 
documents, with 73,238 currently active, and 106,291 processed. At this point, 
if I keep restarting the job, it slowly adds a small number of new docs, but 
then dies again with the same error. The thing is, it has not indexed a large 
number of documents that should be indexable. 
To help clarify, it may be helpful to note that I am indexing a folder that 
contains thousands of folders within that which are named after companies we 
have associated with, with several files and folders within each. If I test 
searches in SOLR by reviewing a document and then performing a search based on 
that text, I get the expected search results consistently when I am searching 
files that are within folders with company names beginning with A, B, C, D, 
etc. However, I do not get results if I search for files within folders with 
company names beginning with R,S,T,U,V, etc. 
We are looking into whether we should batch convert the scanned PDFs to support 
OCR and thereby cut down on the number of problem docs, but for now, I'd like 
to just get all of the indexable documents into SOLR. 
Going back to my original question, should I consider breaking this single job 
into multiple jobs based on the letter of the alphabet? If so, I haven't been 
able to figure out a working regex to tell it to just pick up all files and 
folders within a folder which name begins with R-Z, for example. And if not 
that workaround, where do you suggest I go to resolve this? I'm not entirely 
sure what is causing all of the files and folders not to be traversed before my 
job dies (is this a ManifoldCF thing or a SOLR thing?)
Thanks again for your help.

>>> Ian Zapczynski 3/20/2015 2:25 PM >>>
Thanks for the help, Karl. Yup, I was using the simple-to-set-up single process 
configuration, and silly me.... after I restarted from scratch at one point, I 
completely failed to update the combined-options-env.win config file that you 
referred to, so MCF was still set to use only 256 Mb despite my thinking 
otherwise. I've bumped it up to 4 Gb, and the job recovered and is finally 
again moving along. 
-Ian

>>> Karl Wright <[email protected]> 3/20/2015 10:55 AM >>>
Hi Ian, 

HSQLDB is an interesting database in that it is *not* memory constrained. It 
attempts to keep everything in memory.

I'd strongly suggest either giving the MCF agents process a lot more memory, 
say 2G, if you want to keep using hsqldb. A better choice would be postgresql 
or mysql. There's a configuration file where you can put java switches for all 
of the processes; start by doing that.

Thanks,
Karl

On Fri, Mar 20, 2015 at 9:29 AM, Ian Zapczynski 
<[email protected]> wrote:

Hi Karl,
I have SOLR and ManifoldCF running with Tomcat on a Windows 2012 R2 server. 
Linux would have been my preference, but various logistics prevented me from 
using that. I have set the maximum document length to be 3072000. I chose a 
larger size than what might be normal because when I first did a test, I could 
see that a lot of docs were getting rejected based on size, and it seems folks 
around here don't reduce/shrink the size of their PDFs. 
The errors from the log are below. I was more busy paying attention to the 
errors spit out to the console, which didn't so obviously point to the backend 
database being the culprit. I'm guessing that I'm pushing the database too hard 
and should really be using PostgreSQL, right? I don't know why, but I didn't 
see or reference the deployment documentation that covered using various other 
databases until now. I was working off of the ManifoldCF End User Documentation 
as well as a (mostly) helpful blog post I found elsewhere. 
Much thanks,
-Ian
WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Solr exception during 
indexing 
file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file1.pdf
 (500): Server at http://localhost:8983/solr returned non ok status:500, 
message:Server Error
org.apache.solr.common.SolrException: Server at http://localhost:8983/solr 
returned non ok status:500, message:Server Error
at 
org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:303)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at 
org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:894)
WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Service interruption 
reported for job 1426796577848 connection 'MACLSTR file server': Solr exception 
during indexing 
file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file3.pdf
 (500): Server at http://localhost:8983/solr returned non ok status:500, 
message:Server Error
ERROR 2015-03-19 18:31:45,730 (Job delete thread) - Job delete thread aborting 
and restarting due to database connection reset: Database exception: 
SQLException doing query (S1000): java.lang.OutOfMemoryError: GC overhead limit 
exceeded
ERROR 2015-03-19 18:31:45,309 (Finisher thread) - Finisher thread aborting and 
restarting due to database connection reset: Database exception: SQLException 
doing query (S1000): java.lang.OutOfMemoryError: GC overhead limit exceeded
ERROR 2015-03-19 18:31:43,043 (Set priority thread) - Set priority thread 
aborting and restarting due to database connection reset: Database exception: 
SQLException doing query (S1000): java.lang.OutOfMemoryError: GC overhead limit 
exceeded
ERROR 2015-03-19 18:32:02,292 (Job notification thread) - Job notification 
thread aborting and restarting due to database connection reset: Database 
exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC 
overhead limit exceeded
FATAL 2015-03-19 18:32:05,870 (Thread-3838608) - 
C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem 531146
WARN 2015-03-19 18:32:09,167 (Assessment thread) - Found a long-running query 
(64919 ms): [SELECT id,status,connectionname FROM jobs WHERE assessmentstate=? 
FOR UPDATE]
WARN 2015-03-19 18:32:09,167 (Assessment thread) - Parameter 0: 'N'
ERROR 2015-03-19 18:32:09,167 (Job reset thread) - Job reset thread aborting 
and restarting due to database connection reset: Database exception: 
SQLException doing query (S1000): java.lang.RuntimeException: Logging failed 
when attempting to log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data 
getFromFile out of mem 531146 java.lang.RuntimeException: Logging failed when 
attempting to log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data 
getFromFile out of mem 531146
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: 
SQLException doing query (S1000): java.lang.RuntimeException: Logging failed 
when attempting to log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data 
getFromFile out of mem 531146 java.lang.RuntimeException: Logging failed when 
attempting to log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data 
getFromFile out of mem 531146
at 
org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:702)
at 
org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728)
at 
org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:771)
at 
org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444)
at 
org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)
at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191)
at 
org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performModification(DBInterfaceHSQLDB.java:750)
at 
org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performUpdate(DBInterfaceHSQLDB.java:296)
at 
org.apache.manifoldcf.core.database.BaseTable.performUpdate(BaseTable.java:80)
at 
org.apache.manifoldcf.crawler.jobs.JobQueue.noDocPriorities(JobQueue.java:967)
at 
org.apache.manifoldcf.crawler.jobs.JobManager.noDocPriorities(JobManager.java:8148)
at 
org.apache.manifoldcf.crawler.jobs.JobManager.finishJobStops(JobManager.java:8123)
at 
org.apache.manifoldcf.crawler.system.JobResetThread.run(JobResetThread.java:69)
Caused by: java.sql.SQLException: java.lang.RuntimeException: Logging failed 
when attempting to log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data 
getFromFile out of mem 531146 java.lang.RuntimeException: Logging failed when 
attempting to log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data 
getFromFile out of mem 531146
at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
at org.hsqldb.jdbc.JDBCPreparedStatement.fetchResult(Unknown Source)
at org.hsqldb.jdbc.JDBCPreparedStatement.executeUpdate(Unknown Source)
at org.apache.manifoldcf.core.database.Database.execute(Database.java:903)
at 
org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:683)
Caused by: org.hsqldb.HsqlException: java.lang.RuntimeException: Logging failed 
when attempting to log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data 
getFromFile out of mem 531146
at org.hsqldb.error.Error.error(Unknown Source)
at org.hsqldb.result.Result.newErrorResult(Unknown Source)
at org.hsqldb.StatementDMQL.execute(Unknown Source)
at org.hsqldb.Session.executeCompiledStatement(Unknown Source)
at org.hsqldb.Session.execute(Unknown Source)
... 4 more
Caused by: java.lang.RuntimeException: Logging failed when attempting to log: 
C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem 531146
at org.hsqldb.lib.FrameworkLogger.privlog(Unknown Source)
at org.hsqldb.lib.FrameworkLogger.severe(Unknown Source)
at org.hsqldb.persist.Logger.logSevereEvent(Unknown Source)
at org.hsqldb.persist.DataFileCache.logSevereEvent(Unknown Source)
at org.hsqldb.persist.DataFileCache.getFromFile(Unknown Source)
at org.hsqldb.persist.DataFileCache.get(Unknown Source)
at org.hsqldb.persist.RowStoreAVLDisk.get(Unknown Source)
at org.hsqldb.index.NodeAVLDisk.findNode(Unknown Source)
at org.hsqldb.index.NodeAVLDisk.getRight(Unknown Source)
at org.hsqldb.index.IndexAVL.next(Unknown Source)
at org.hsqldb.index.IndexAVL.next(Unknown Source)
at org.hsqldb.index.IndexAVL$IndexRowIterator.getNextRow(Unknown Source)
at org.hsqldb.RangeVariable$RangeIteratorMain.findNext(Unknown Source)
at org.hsqldb.RangeVariable$RangeIteratorMain.next(Unknown Source)
at org.hsqldb.StatementDML.executeUpdateStatement(Unknown Source)
at org.hsqldb.StatementDML.getResult(Unknown Source)
... 7 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
... 23 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
WARN 2015-03-19 18:32:09,167 (Assessment thread) - Plan: 
isDistinctSelect=[false]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: isGrouped=[false]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: isAggregated=[false]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: columns=[ COLUMN: 
PUBLIC.JOBS.ID not nullable
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: COLUMN: 
PUBLIC.JOBS.STATUS not nullable
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: COLUMN: 
PUBLIC.JOBS.CONNECTIONNAME not nullable
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: 
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: [range variable 1
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: join type=INNER
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: table=JOBS
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: cardinality=5
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: access=FULL SCAN
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: join condition = 
[index=SYS_IDX_SYS_PK_10234_10237
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: other condition=[
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: EQUAL arg_left=[ 
COLUMN: PUBLIC.JOBS.ASSESSMENTSTATE
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ] arg_right=[ DYNAMIC 
PARAM: , TYPE = CHARACTER
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: PARAMETERS=[
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: @0[DYNAMIC PARAM: , 
TYPE = CHARACTER
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: SUBQUERIES[]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - 
FATAL 2015-03-19 18:32:09,198 (Job notification thread) - JobNotificationThread 
initialization error tossed: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
FATAL 2015-03-19 18:32:09,198 (Set priority thread) - SetPriorityThread 
initialization error tossed: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
FATAL 2015-03-19 18:32:09,198 (Finisher thread) - FinisherThread initialization 
error tossed: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
FATAL 2015-03-19 18:32:09,198 (Job delete thread) - JobDeleteThread 
initialization error tossed: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
FATAL 2015-03-19 18:32:09,198 (Seeding thread) - SeedingThread initialization 
error tossed: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded

>>> Karl Wright <[email protected]> 3/19/2015 3:34 PM >>>
Hi Ian, 

ManifoldCF operates under what is known as a "bounded" memory model. That means 
that you should always be able to find a memory size that works (that isn't 
huge).

The only exception to this is for Solr indexing that does *not* go via the 
extracting update handler. The standard update handler unfortunately *requires* 
that the entire document fit in memory. If this is what you are doing, you must 
take steps to limit the maximum document size to prevent OOM's.

160,000 documents is quite small by MCF standards (we do 10 million to 50 
million on some setups). So let's diagnose your problem before taking any 
bizarre actions. Can you provide an out-of-memory dump from the log, for 
instance? Can you let us know what deployment model you are using (e.g. 
single-process, etc.)?

Thanks,
Karl

On Thu, Mar 19, 2015 at 3:07 PM, Ian Zapczynski 
<[email protected]> wrote:

Hello all. I am using ManifoldCF to index a Windows share containing well over 
160,000 files (.xls, .pdf, .doc). I keep getting memory errors when I try to 
index the whole folder at once and have not been able to resolve this by 
throwing memory and CPU at Tomcat and the VM, so I thought I'd try this a 
different way.
What I'd like to do now is break what was a single job up into multiple jobs. 
Each job should index all indexable files under a parent folder, with one job 
indexing folders whose names begin with the letters A-G as well as all 
subfolders and files within, another job for H-M also with all 
subfolders/files, and so on. My problem is, somehow I can't manage to figure 
out what expression to use to get it to index what I want. 
In the Job settings under Paths, I have specified the parent folder, and within 
there I've tried:
1. Include file(s) or directory(s) matching * (this works, but indexes every 
file in every folder within the parent, eventually causing me unresolvable GC 
memory overhead errors)
2. Include file(s) or directory(s) matching ^(?i)[A-G]* (this does not work; it 
supposedly indexes one file and then quits)
3. Include file(s) or directory(s) matching A* (this does not work; it 
supposedly indexes one file and then quits, and there are many folders directly 
under the parent that begin with 'A')
Can anyone help confirm what type of expression I should use in the paths to 
accomplish what I want? 
Or alternately if you think I should be able to index 160,000+ files in one job 
without getting GC memory overhead errors, I'm open to hear your suggestions on 
resolving those. All I know to do is increase the maximum memory in Tomcat as 
well as on the OS, and that didn't help at all. 
Thanks much!

-Ian

Re: Need examples of expressions used to specify multiple folders to index

Reply via email to