Re: Need examples of expressions used to specify multiple folders to index

Karl Wright Tue, 24 Mar 2015 09:18:54 -0700

Hi,

The TikaException issue means that Tika is having trouble processing your
document.  You can set up Solr to disable Tika exceptions.  You can
probably find the documentation here:


https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

Karl



On Tue, Mar 24, 2015 at 11:50 AM, Ian Zapczynski <
[email protected]> wrote:

>  I mostly get these repeated many, many times over:
>
> ERROR - 2015-03-24 14:48:40.321; org.apache.solr.common.SolrException;
> null:org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@46a9acab
>  at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
> <snip>
> Caused by: org.apache.tika.exception.TikaException: Unexpected
> RuntimeException from org.apache.tika.parser.pdf.PDFParser@46a9acab
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> <snip>
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
> range: 0
>  at java.lang.String.charAt(String.java:646)
>
> Then of course I get some of these, which are expected when we have
> encrypted or password-protected files:
>
> ERROR - 2015-03-24 14:48:40.962; org.apache.solr.common.SolrException;
> org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@38902a7
>  at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
>  <snip>
> Caused by: org.apache.tika.exception.TikaException: Unexpected
> RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@38902a7
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> <snip>
> Caused by: org.apache.poi.EncryptedDocumentException: Cannot process
> encrypted word file
>
>
>
>
> >>> Karl Wright <[email protected]> 3/24/2015 11:22 AM >>>
> " failure processing document Server athttp://localhost:8983/solr
> returned non ok status:500, message:Server Error""
>
> That's an error occurring on Solr. What do the Solr logs say?
>
> Karl
>
>
> On Tue, Mar 24, 2015 at 11:11 AM, Ian Zapczynski <
> [email protected]> wrote:
>
>>  Unfortunately I'm still getting stuck indexing. It at least appears to
>> me that I have such a large number of password protected docs and scanned
>> PDFs without OCR enabled that the job is dying on me before it even finds
>> all the "good" docs. It will die with the error "Error: Repeated service
>> interruptions - failure processing document Server at
>> http://localhost:8983/solr returned non ok status:500, message:Server
>> Error". My job tells me there are 168,595 documents, with 73,238 currently
>> active, and 106,291 processed. At this point, if I keep restarting the job,
>> it slowly adds a small number of new docs, but then dies again with the
>> same error. The thing is, it has not indexed a large number of documents
>> that should be indexable.
>>  To help clarify, it may be helpful to note that I am indexing a folder
>> that contains thousands of folders within that which are named after
>> companies we have associated with, with several files and folders within
>> each. If I test searches in SOLR by reviewing a document and then
>> performing a search based on that text, I get the expected search results
>> consistently when I am searching files that are within folders with company
>> names beginning with A, B, C, D, etc. However, I do not get results if I
>> search for files within folders with company names beginning with
>> R,S,T,U,V, etc.
>>  We are looking into whether we should batch convert the scanned PDFs to
>> support OCR and thereby cut down on the number of problem docs, but for
>> now, I'd like to just get all of the indexable documents into SOLR.
>>  Going back to my original question, should I consider breaking this
>> single job into multiple jobs based on the letter of the alphabet? If so, I
>> haven't been able to figure out a working regex to tell it to just pick up
>> all files and folders within a folder which name begins with R-Z, for
>> example. And if not that workaround, where do you suggest I go to resolve
>> this? I'm not entirely sure what is causing all of the files and folders
>> not to be traversed before my job dies (is this a ManifoldCF thing or a
>> SOLR thing?)
>>  Thanks again for your help.
>>
>>
>> >>> Ian Zapczynski 3/20/2015 2:25 PM >>>
>> Thanks for the help, Karl. Yup, I was using the simple-to-set-up single
>> process configuration, and silly me.... after I restarted from scratch at
>> one point, I completely failed to update the combined-options-env.win
>> config file that you referred to, so MCF was still set to use only 256 Mb
>> despite my thinking otherwise. I've bumped it up to 4 Gb, and the job
>> recovered and is finally again moving along.
>> -Ian
>>
>> >>> Karl Wright <[email protected]> 3/20/2015 10:55 AM >>>
>>  Hi Ian,
>>
>> HSQLDB is an interesting database in that it is *not* memory constrained.
>> It attempts to keep everything in memory.
>>
>> I'd strongly suggest either giving the MCF agents process a lot more
>> memory, say 2G, if you want to keep using hsqldb. A better choice would be
>> postgresql or mysql. There's a configuration file where you can put java
>> switches for all of the processes; start by doing that.
>>
>> Thanks,
>> Karl
>>
>>
>>
>>
>> On Fri, Mar 20, 2015 at 9:29 AM, Ian Zapczynski <
>> [email protected]> wrote:
>>
>>>  Hi Karl,
>>>  I have SOLR and ManifoldCF running with Tomcat on a Windows 2012 R2
>>> server. Linux would have been my preference, but various logistics
>>> prevented me from using that. I have set the maximum document length to be
>>> 3072000. I chose a larger size than what might be normal because when I
>>> first did a test, I could see that a lot of docs were getting rejected
>>> based on size, and it seems folks around here don't reduce/shrink the size
>>> of their PDFs.
>>>  The errors from the log are below. I was more busy paying attention to
>>> the errors spit out to the console, which didn't so obviously point to the
>>> backend database being the culprit. I'm guessing that I'm pushing the
>>> database too hard and should really be using PostgreSQL, right? I don't
>>> know why, but I didn't see or reference the deployment documentation that
>>> covered using various other databases until now. I was working off of the
>>> ManifoldCF End User Documentation as well as a (mostly) helpful blog post I
>>> found elsewhere.
>>>  Much thanks,
>>>  -Ian
>>>  WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Solr exception
>>> during indexing
>>> file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file1.pdf
>>> (500): Server at http://localhost:8983/solr returned non ok status:500,
>>> message:Server Error
>>> org.apache.solr.common.SolrException: Server at
>>> http://localhost:8983/solr returned non ok status:500, message:Server
>>> Error
>>> at
>>> org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:303)
>>> at
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
>>> at
>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
>>> at
>>> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:894)
>>> WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Service interruption
>>> reported for job 1426796577848 connection 'MACLSTR file server': Solr
>>> exception during indexing
>>> file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file3.pdf
>>> (500): Server at http://localhost:8983/solr returned non ok status:500,
>>> message:Server Error
>>> ERROR 2015-03-19 18:31:45,730 (Job delete thread) - Job delete thread
>>> aborting and restarting due to database connection reset: Database
>>> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC
>>> overhead limit exceeded
>>> ERROR 2015-03-19 18:31:45,309 (Finisher thread) - Finisher thread
>>> aborting and restarting due to database connection reset: Database
>>> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC
>>> overhead limit exceeded
>>> ERROR 2015-03-19 18:31:43,043 (Set priority thread) - Set priority
>>> thread aborting and restarting due to database connection reset: Database
>>> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC
>>> overhead limit exceeded
>>> ERROR 2015-03-19 18:32:02,292 (Job notification thread) - Job
>>> notification thread aborting and restarting due to database connection
>>> reset: Database exception: SQLException doing query (S1000):
>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>> FATAL 2015-03-19 18:32:05,870 (Thread-3838608) -
>>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>>> 531146
>>> WARN 2015-03-19 18:32:09,167 (Assessment thread) - Found a long-running
>>> query (64919 ms): [SELECT id,status,connectionname FROM jobs WHERE
>>> assessmentstate=? FOR UPDATE]
>>> WARN 2015-03-19 18:32:09,167 (Assessment thread) - Parameter 0: 'N'
>>> ERROR 2015-03-19 18:32:09,167 (Job reset thread) - Job reset thread
>>> aborting and restarting due to database connection reset: Database
>>> exception: SQLException doing query (S1000): java.lang.RuntimeException:
>>> Logging failed when attempting to log:
>>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>>> 531146 java.lang.RuntimeException: Logging failed when attempting to log:
>>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>>> 531146
>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
>>> exception: SQLException doing query (S1000): java.lang.RuntimeException:
>>> Logging failed when attempting to log:
>>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>>> 531146 java.lang.RuntimeException: Logging failed when attempting to log:
>>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>>> 531146
>>> at
>>> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:702)
>>> at
>>> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728)
>>> at
>>> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:771)
>>> at
>>> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444)
>>> at
>>> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)
>>> at
>>> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191)
>>> at
>>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performModification(DBInterfaceHSQLDB.java:750)
>>> at
>>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performUpdate(DBInterfaceHSQLDB.java:296)
>>> at
>>> org.apache.manifoldcf.core.database.BaseTable.performUpdate(BaseTable.java:80)
>>> at
>>> org.apache.manifoldcf.crawler.jobs.JobQueue.noDocPriorities(JobQueue.java:967)
>>> at
>>> org.apache.manifoldcf.crawler.jobs.JobManager.noDocPriorities(JobManager.java:8148)
>>> at
>>> org.apache.manifoldcf.crawler.jobs.JobManager.finishJobStops(JobManager.java:8123)
>>> at
>>> org.apache.manifoldcf.crawler.system.JobResetThread.run(JobResetThread.java:69)
>>> Caused by: java.sql.SQLException: java.lang.RuntimeException: Logging
>>> failed when attempting to log:
>>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>>> 531146 java.lang.RuntimeException: Logging failed when attempting to log:
>>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>>> 531146
>>> at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
>>> at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
>>> at org.hsqldb.jdbc.JDBCPreparedStatement.fetchResult(Unknown Source)
>>> at org.hsqldb.jdbc.JDBCPreparedStatement.executeUpdate(Unknown Source)
>>> at
>>> org.apache.manifoldcf.core.database.Database.execute(Database.java:903)
>>> at
>>> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:683)
>>> Caused by: org.hsqldb.HsqlException: java.lang.RuntimeException: Logging
>>> failed when attempting to log:
>>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>>> 531146
>>> at org.hsqldb.error.Error.error(Unknown Source)
>>> at org.hsqldb.result.Result.newErrorResult(Unknown Source)
>>> at org.hsqldb.StatementDMQL.execute(Unknown Source)
>>> at org.hsqldb.Session.executeCompiledStatement(Unknown Source)
>>> at org.hsqldb.Session.execute(Unknown Source)
>>> ... 4 more
>>> Caused by: java.lang.RuntimeException: Logging failed when attempting to
>>> log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of
>>> mem 531146
>>> at org.hsqldb.lib.FrameworkLogger.privlog(Unknown Source)
>>> at org.hsqldb.lib.FrameworkLogger.severe(Unknown Source)
>>> at org.hsqldb.persist.Logger.logSevereEvent(Unknown Source)
>>> at org.hsqldb.persist.DataFileCache.logSevereEvent(Unknown Source)
>>> at org.hsqldb.persist.DataFileCache.getFromFile(Unknown Source)
>>> at org.hsqldb.persist.DataFileCache.get(Unknown Source)
>>> at org.hsqldb.persist.RowStoreAVLDisk.get(Unknown Source)
>>> at org.hsqldb.index.NodeAVLDisk.findNode(Unknown Source)
>>> at org.hsqldb.index.NodeAVLDisk.getRight(Unknown Source)
>>> at org.hsqldb.index.IndexAVL.next(Unknown Source)
>>> at org.hsqldb.index.IndexAVL.next(Unknown Source)
>>> at org.hsqldb.index.IndexAVL$IndexRowIterator.getNextRow(Unknown Source)
>>> at org.hsqldb.RangeVariable$RangeIteratorMain.findNext(Unknown Source)
>>> at org.hsqldb.RangeVariable$RangeIteratorMain.next(Unknown Source)
>>> at org.hsqldb.StatementDML.executeUpdateStatement(Unknown Source)
>>> at org.hsqldb.StatementDML.getResult(Unknown Source)
>>> ... 7 more
>>> Caused by: java.lang.reflect.InvocationTargetException
>>> at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:483)
>>> ... 23 more
>>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>> WARN 2015-03-19 18:32:09,167 (Assessment thread) - Plan:
>>> isDistinctSelect=[false]
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan:
>>> isGrouped=[false]
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan:
>>> isAggregated=[false]
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: columns=[
>>> COLUMN: PUBLIC.JOBS.ID not nullable
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: COLUMN:
>>> PUBLIC.JOBS.STATUS not nullable
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: COLUMN:
>>> PUBLIC.JOBS.CONNECTIONNAME not nullable
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan:
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: [range variable
>>> 1
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: join type=INNER
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: table=JOBS
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: cardinality=5
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: access=FULL SCAN
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: join condition
>>> = [index=SYS_IDX_SYS_PK_10234_10237
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: other
>>> condition=[
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: EQUAL
>>> arg_left=[ COLUMN: PUBLIC.JOBS.ASSESSMENTSTATE
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ] arg_right=[
>>> DYNAMIC PARAM: , TYPE = CHARACTER
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]]
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]]
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: PARAMETERS=[
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: @0[DYNAMIC
>>> PARAM: , TYPE = CHARACTER
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]]
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: SUBQUERIES[]
>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) -
>>> FATAL 2015-03-19 18:32:09,198 (Job notification thread) -
>>> JobNotificationThread initialization error tossed: GC overhead limit
>>> exceeded
>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>> FATAL 2015-03-19 18:32:09,198 (Set priority thread) - SetPriorityThread
>>> initialization error tossed: GC overhead limit exceeded
>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>> FATAL 2015-03-19 18:32:09,198 (Finisher thread) - FinisherThread
>>> initialization error tossed: GC overhead limit exceeded
>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>> FATAL 2015-03-19 18:32:09,198 (Job delete thread) - JobDeleteThread
>>> initialization error tossed: GC overhead limit exceeded
>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>> FATAL 2015-03-19 18:32:09,198 (Seeding thread) - SeedingThread
>>> initialization error tossed: GC overhead limit exceeded
>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>
>>> >>> Karl Wright <[email protected]> 3/19/2015 3:34 PM >>>
>>>  Hi Ian,
>>>
>>> ManifoldCF operates under what is known as a "bounded" memory model.
>>> That means that you should always be able to find a memory size that works
>>> (that isn't huge).
>>>
>>> The only exception to this is for Solr indexing that does *not* go via
>>> the extracting update handler. The standard update handler unfortunately
>>> *requires* that the entire document fit in memory. If this is what you are
>>> doing, you must take steps to limit the maximum document size to prevent
>>> OOM's.
>>>
>>> 160,000 documents is quite small by MCF standards (we do 10 million to
>>> 50 million on some setups). So let's diagnose your problem before taking
>>> any bizarre actions. Can you provide an out-of-memory dump from the log,
>>> for instance? Can you let us know what deployment model you are using (e.g.
>>> single-process, etc.)?
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Thu, Mar 19, 2015 at 3:07 PM, Ian Zapczynski <
>>> [email protected]> wrote:
>>>
>>>>  Hello all. I am using ManifoldCF to index a Windows share containing
>>>> well over 160,000 files (.xls, .pdf, .doc). I keep getting memory errors
>>>> when I try to index the whole folder at once and have not been able to
>>>> resolve this by throwing memory and CPU at Tomcat and the VM, so I thought
>>>> I'd try this a different way.
>>>>  What I'd like to do now is break what was a single job up into
>>>> multiple jobs. Each job should index all indexable files under a parent
>>>> folder, with one job indexing folders whose names begin with the letters
>>>> A-G as well as all subfolders and files within, another job for H-M also
>>>> with all subfolders/files, and so on. My problem is, somehow I can't manage
>>>> to figure out what expression to use to get it to index what I want.
>>>>  In the Job settings under Paths, I have specified the parent folder,
>>>> and within there I've tried:
>>>>  1. Include file(s) or directory(s) matching * (this works, but
>>>> indexes every file in every folder within the parent, eventually causing me
>>>> unresolvable GC memory overhead errors)
>>>> 2. Include file(s) or directory(s) matching ^(?i)[A-G]* (this does not
>>>> work; it supposedly indexes one file and then quits)
>>>> 3. Include file(s) or directory(s) matching A* (this does not work; it
>>>> supposedly indexes one file and then quits, and there are many folders
>>>> directly under the parent that begin with 'A')
>>>>  Can anyone help confirm what type of expression I should use in the
>>>> paths to accomplish what I want?
>>>>  Or alternately if you think I should be able to index 160,000+ files
>>>> in one job without getting GC memory overhead errors, I'm open to hear your
>>>> suggestions on resolving those. All I know to do is increase the maximum
>>>> memory in Tomcat as well as on the OS, and that didn't help at all.
>>>>  Thanks much!
>>>>  -Ian
>>>>
>>>
>>>
>>
>

Re: Need examples of expressions used to specify multiple folders to index

Reply via email to