Re: Need examples of expressions used to specify multiple folders to index

Karl Wright Tue, 24 Mar 2015 09:18:52 -0700

Just to clarify, it looks like the primary problem is this:

>>>>>>
Caused by: org.apache.poi.EncryptedDocumentException: Cannot process
encrypted word file
<<<<<<


That's because Solr is not going to be able to index encrypted file
contents no matter what you do.  Best you can hope for is that there are
not too many of this kind of document, and skip the ones where it happens.
Turning off Tika exceptions, as described above, will do that.

Thanks,
Karl


On Tue, Mar 24, 2015 at 12:16 PM, Karl Wright <[email protected]> wrote:

> Hi,
>
> The TikaException issue means that Tika is having trouble processing your
> document.  You can set up Solr to disable Tika exceptions.  You can
> probably find the documentation here:
>
>
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
>
> Karl
>
>
>
> On Tue, Mar 24, 2015 at 11:50 AM, Ian Zapczynski <
> [email protected]> wrote:
>
>>  I mostly get these repeated many, many times over:
>>
>> ERROR - 2015-03-24 14:48:40.321; org.apache.solr.common.SolrException;
>> null:org.apache.solr.common.SolrException:
>> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
>> org.apache.tika.parser.pdf.PDFParser@46a9acab
>>  at
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
>> <snip>
>> Caused by: org.apache.tika.exception.TikaException: Unexpected
>> RuntimeException from org.apache.tika.parser.pdf.PDFParser@46a9acab
>>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>> <snip>
>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
>> range: 0
>>  at java.lang.String.charAt(String.java:646)
>>
>> Then of course I get some of these, which are expected when we have
>> encrypted or password-protected files:
>>
>> ERROR - 2015-03-24 14:48:40.962; org.apache.solr.common.SolrException;
>> org.apache.solr.common.SolrException:
>> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
>> org.apache.tika.parser.microsoft.OfficeParser@38902a7
>>  at
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
>>  <snip>
>> Caused by: org.apache.tika.exception.TikaException: Unexpected
>> RuntimeException from
>> org.apache.tika.parser.microsoft.OfficeParser@38902a7
>>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>> <snip>
>> Caused by: org.apache.poi.EncryptedDocumentException: Cannot process
>> encrypted word file
>>
>>
>>
>>
>> >>> Karl Wright <[email protected]> 3/24/2015 11:22 AM >>>
>> " failure processing document Server athttp://localhost:8983/solr
>> returned non ok status:500, message:Server Error""
>>
>> That's an error occurring on Solr. What do the Solr logs say?
>>
>> Karl
>>
>>
>> On Tue, Mar 24, 2015 at 11:11 AM, Ian Zapczynski <
>> [email protected]> wrote:
>>
>>>  Unfortunately I'm still getting stuck indexing. It at least appears to
>>> me that I have such a large number of password protected docs and scanned
>>> PDFs without OCR enabled that the job is dying on me before it even finds
>>> all the "good" docs. It will die with the error "Error: Repeated service
>>> interruptions - failure processing document Server at
>>> http://localhost:8983/solr returned non ok status:500, message:Server
>>> Error". My job tells me there are 168,595 documents, with 73,238 currently
>>> active, and 106,291 processed. At this point, if I keep restarting the job,
>>> it slowly adds a small number of new docs, but then dies again with the
>>> same error. The thing is, it has not indexed a large number of documents
>>> that should be indexable.
>>>  To help clarify, it may be helpful to note that I am indexing a folder
>>> that contains thousands of folders within that which are named after
>>> companies we have associated with, with several files and folders within
>>> each. If I test searches in SOLR by reviewing a document and then
>>> performing a search based on that text, I get the expected search results
>>> consistently when I am searching files that are within folders with company
>>> names beginning with A, B, C, D, etc. However, I do not get results if I
>>> search for files within folders with company names beginning with
>>> R,S,T,U,V, etc.
>>>  We are looking into whether we should batch convert the scanned PDFs
>>> to support OCR and thereby cut down on the number of problem docs, but for
>>> now, I'd like to just get all of the indexable documents into SOLR.
>>>  Going back to my original question, should I consider breaking this
>>> single job into multiple jobs based on the letter of the alphabet? If so, I
>>> haven't been able to figure out a working regex to tell it to just pick up
>>> all files and folders within a folder which name begins with R-Z, for
>>> example. And if not that workaround, where do you suggest I go to resolve
>>> this? I'm not entirely sure what is causing all of the files and folders
>>> not to be traversed before my job dies (is this a ManifoldCF thing or a
>>> SOLR thing?)
>>>  Thanks again for your help.
>>>
>>>
>>> >>> Ian Zapczynski 3/20/2015 2:25 PM >>>
>>> Thanks for the help, Karl. Yup, I was using the simple-to-set-up single
>>> process configuration, and silly me.... after I restarted from scratch at
>>> one point, I completely failed to update the combined-options-env.win
>>> config file that you referred to, so MCF was still set to use only 256 Mb
>>> despite my thinking otherwise. I've bumped it up to 4 Gb, and the job
>>> recovered and is finally again moving along.
>>> -Ian
>>>
>>> >>> Karl Wright <[email protected]> 3/20/2015 10:55 AM >>>
>>>  Hi Ian,
>>>
>>> HSQLDB is an interesting database in that it is *not* memory
>>> constrained. It attempts to keep everything in memory.
>>>
>>> I'd strongly suggest either giving the MCF agents process a lot more
>>> memory, say 2G, if you want to keep using hsqldb. A better choice would be
>>> postgresql or mysql. There's a configuration file where you can put java
>>> switches for all of the processes; start by doing that.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>>
>>> On Fri, Mar 20, 2015 at 9:29 AM, Ian Zapczynski <
>>> [email protected]> wrote:
>>>
>>>>  Hi Karl,
>>>>  I have SOLR and ManifoldCF running with Tomcat on a Windows 2012 R2
>>>> server. Linux would have been my preference, but various logistics
>>>> prevented me from using that. I have set the maximum document length to be
>>>> 3072000. I chose a larger size than what might be normal because when I
>>>> first did a test, I could see that a lot of docs were getting rejected
>>>> based on size, and it seems folks around here don't reduce/shrink the size
>>>> of their PDFs.
>>>>  The errors from the log are below. I was more busy paying attention
>>>> to the errors spit out to the console, which didn't so obviously point to
>>>> the backend database being the culprit. I'm guessing that I'm pushing the
>>>> database too hard and should really be using PostgreSQL, right? I don't
>>>> know why, but I didn't see or reference the deployment documentation that
>>>> covered using various other databases until now. I was working off of the
>>>> ManifoldCF End User Documentation as well as a (mostly) helpful blog post I
>>>> found elsewhere.
>>>>  Much thanks,
>>>>  -Ian
>>>>  WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Solr exception
>>>> during indexing
>>>> file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file1.pdf
>>>> (500): Server at http://localhost:8983/solr returned non ok
>>>> status:500, message:Server Error
>>>> org.apache.solr.common.SolrException: Server at
>>>> http://localhost:8983/solr returned non ok status:500, message:Server
>>>> Error
>>>> at
>>>> org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:303)
>>>> at
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
>>>> at
>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
>>>> at
>>>> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:894)
>>>> WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Service
>>>> interruption reported for job 1426796577848 connection 'MACLSTR file
>>>> server': Solr exception during indexing
>>>> file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file3.pdf
>>>> (500): Server at http://localhost:8983/solr returned non ok
>>>> status:500, message:Server Error
>>>> ERROR 2015-03-19 18:31:45,730 (Job delete thread) - Job delete thread
>>>> aborting and restarting due to database connection reset: Database
>>>> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC
>>>> overhead limit exceeded
>>>> ERROR 2015-03-19 18:31:45,309 (Finisher thread) - Finisher thread
>>>> aborting and restarting due to database connection reset: Database
>>>> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC
>>>> overhead limit exceeded
>>>> ERROR 2015-03-19 18:31:43,043 (Set priority thread) - Set priority
>>>> thread aborting and restarting due to database connection reset: Database
>>>> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC
>>>> overhead limit exceeded
>>>> ERROR 2015-03-19 18:32:02,292 (Job notification thread) - Job
>>>> notification thread aborting and restarting due to database connection
>>>> reset: Database exception: SQLException doing query (S1000):
>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>> FATAL 2015-03-19 18:32:05,870 (Thread-3838608) -
>>>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>>>> 531146
>>>> WARN 2015-03-19 18:32:09,167 (Assessment thread) - Found a long-running
>>>> query (64919 ms): [SELECT id,status,connectionname FROM jobs WHERE
>>>> assessmentstate=? FOR UPDATE]
>>>> WARN 2015-03-19 18:32:09,167 (Assessment thread) - Parameter 0: 'N'
>>>> ERROR 2015-03-19 18:32:09,167 (Job reset thread) - Job reset thread
>>>> aborting and restarting due to database connection reset: Database
>>>> exception: SQLException doing query (S1000): java.lang.RuntimeException:
>>>> Logging failed when attempting to log:
>>>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>>>> 531146 java.lang.RuntimeException: Logging failed when attempting to log:
>>>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>>>> 531146
>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
>>>> exception: SQLException doing query (S1000): java.lang.RuntimeException:
>>>> Logging failed when attempting to log:
>>>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>>>> 531146 java.lang.RuntimeException: Logging failed when attempting to log:
>>>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>>>> 531146
>>>> at
>>>> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:702)
>>>> at
>>>> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728)
>>>> at
>>>> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:771)
>>>> at
>>>> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444)
>>>> at
>>>> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)
>>>> at
>>>> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191)
>>>> at
>>>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performModification(DBInterfaceHSQLDB.java:750)
>>>> at
>>>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performUpdate(DBInterfaceHSQLDB.java:296)
>>>> at
>>>> org.apache.manifoldcf.core.database.BaseTable.performUpdate(BaseTable.java:80)
>>>> at
>>>> org.apache.manifoldcf.crawler.jobs.JobQueue.noDocPriorities(JobQueue.java:967)
>>>> at
>>>> org.apache.manifoldcf.crawler.jobs.JobManager.noDocPriorities(JobManager.java:8148)
>>>> at
>>>> org.apache.manifoldcf.crawler.jobs.JobManager.finishJobStops(JobManager.java:8123)
>>>> at
>>>> org.apache.manifoldcf.crawler.system.JobResetThread.run(JobResetThread.java:69)
>>>> Caused by: java.sql.SQLException: java.lang.RuntimeException: Logging
>>>> failed when attempting to log:
>>>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>>>> 531146 java.lang.RuntimeException: Logging failed when attempting to log:
>>>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>>>> 531146
>>>> at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
>>>> at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
>>>> at org.hsqldb.jdbc.JDBCPreparedStatement.fetchResult(Unknown Source)
>>>> at org.hsqldb.jdbc.JDBCPreparedStatement.executeUpdate(Unknown Source)
>>>> at
>>>> org.apache.manifoldcf.core.database.Database.execute(Database.java:903)
>>>> at
>>>> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:683)
>>>> Caused by: org.hsqldb.HsqlException: java.lang.RuntimeException:
>>>> Logging failed when attempting to log:
>>>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>>>> 531146
>>>> at org.hsqldb.error.Error.error(Unknown Source)
>>>> at org.hsqldb.result.Result.newErrorResult(Unknown Source)
>>>> at org.hsqldb.StatementDMQL.execute(Unknown Source)
>>>> at org.hsqldb.Session.executeCompiledStatement(Unknown Source)
>>>> at org.hsqldb.Session.execute(Unknown Source)
>>>> ... 4 more
>>>> Caused by: java.lang.RuntimeException: Logging failed when attempting
>>>> to log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out
>>>> of mem 531146
>>>> at org.hsqldb.lib.FrameworkLogger.privlog(Unknown Source)
>>>> at org.hsqldb.lib.FrameworkLogger.severe(Unknown Source)
>>>> at org.hsqldb.persist.Logger.logSevereEvent(Unknown Source)
>>>> at org.hsqldb.persist.DataFileCache.logSevereEvent(Unknown Source)
>>>> at org.hsqldb.persist.DataFileCache.getFromFile(Unknown Source)
>>>> at org.hsqldb.persist.DataFileCache.get(Unknown Source)
>>>> at org.hsqldb.persist.RowStoreAVLDisk.get(Unknown Source)
>>>> at org.hsqldb.index.NodeAVLDisk.findNode(Unknown Source)
>>>> at org.hsqldb.index.NodeAVLDisk.getRight(Unknown Source)
>>>> at org.hsqldb.index.IndexAVL.next(Unknown Source)
>>>> at org.hsqldb.index.IndexAVL.next(Unknown Source)
>>>> at org.hsqldb.index.IndexAVL$IndexRowIterator.getNextRow(Unknown Source)
>>>> at org.hsqldb.RangeVariable$RangeIteratorMain.findNext(Unknown Source)
>>>> at org.hsqldb.RangeVariable$RangeIteratorMain.next(Unknown Source)
>>>> at org.hsqldb.StatementDML.executeUpdateStatement(Unknown Source)
>>>> at org.hsqldb.StatementDML.getResult(Unknown Source)
>>>> ... 7 more
>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>> at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> at java.lang.reflect.Method.invoke(Method.java:483)
>>>> ... 23 more
>>>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>> WARN 2015-03-19 18:32:09,167 (Assessment thread) - Plan:
>>>> isDistinctSelect=[false]
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan:
>>>> isGrouped=[false]
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan:
>>>> isAggregated=[false]
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: columns=[
>>>> COLUMN: PUBLIC.JOBS.ID not nullable
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: COLUMN:
>>>> PUBLIC.JOBS.STATUS not nullable
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: COLUMN:
>>>> PUBLIC.JOBS.CONNECTIONNAME not nullable
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan:
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: [range
>>>> variable 1
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: join type=INNER
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: table=JOBS
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: cardinality=5
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: access=FULL
>>>> SCAN
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: join condition
>>>> = [index=SYS_IDX_SYS_PK_10234_10237
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: other
>>>> condition=[
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: EQUAL
>>>> arg_left=[ COLUMN: PUBLIC.JOBS.ASSESSMENTSTATE
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ] arg_right=[
>>>> DYNAMIC PARAM: , TYPE = CHARACTER
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]]
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]]
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: PARAMETERS=[
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: @0[DYNAMIC
>>>> PARAM: , TYPE = CHARACTER
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]]
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: SUBQUERIES[]
>>>> WARN 2015-03-19 18:32:09,182 (Assessment thread) -
>>>> FATAL 2015-03-19 18:32:09,198 (Job notification thread) -
>>>> JobNotificationThread initialization error tossed: GC overhead limit
>>>> exceeded
>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>> FATAL 2015-03-19 18:32:09,198 (Set priority thread) - SetPriorityThread
>>>> initialization error tossed: GC overhead limit exceeded
>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>> FATAL 2015-03-19 18:32:09,198 (Finisher thread) - FinisherThread
>>>> initialization error tossed: GC overhead limit exceeded
>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>> FATAL 2015-03-19 18:32:09,198 (Job delete thread) - JobDeleteThread
>>>> initialization error tossed: GC overhead limit exceeded
>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>> FATAL 2015-03-19 18:32:09,198 (Seeding thread) - SeedingThread
>>>> initialization error tossed: GC overhead limit exceeded
>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>
>>>> >>> Karl Wright <[email protected]> 3/19/2015 3:34 PM >>>
>>>>  Hi Ian,
>>>>
>>>> ManifoldCF operates under what is known as a "bounded" memory model.
>>>> That means that you should always be able to find a memory size that works
>>>> (that isn't huge).
>>>>
>>>> The only exception to this is for Solr indexing that does *not* go via
>>>> the extracting update handler. The standard update handler unfortunately
>>>> *requires* that the entire document fit in memory. If this is what you are
>>>> doing, you must take steps to limit the maximum document size to prevent
>>>> OOM's.
>>>>
>>>> 160,000 documents is quite small by MCF standards (we do 10 million to
>>>> 50 million on some setups). So let's diagnose your problem before taking
>>>> any bizarre actions. Can you provide an out-of-memory dump from the log,
>>>> for instance? Can you let us know what deployment model you are using (e.g.
>>>> single-process, etc.)?
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Thu, Mar 19, 2015 at 3:07 PM, Ian Zapczynski <
>>>> [email protected]> wrote:
>>>>
>>>>>  Hello all. I am using ManifoldCF to index a Windows share containing
>>>>> well over 160,000 files (.xls, .pdf, .doc). I keep getting memory errors
>>>>> when I try to index the whole folder at once and have not been able to
>>>>> resolve this by throwing memory and CPU at Tomcat and the VM, so I thought
>>>>> I'd try this a different way.
>>>>>  What I'd like to do now is break what was a single job up into
>>>>> multiple jobs. Each job should index all indexable files under a parent
>>>>> folder, with one job indexing folders whose names begin with the letters
>>>>> A-G as well as all subfolders and files within, another job for H-M also
>>>>> with all subfolders/files, and so on. My problem is, somehow I can't 
>>>>> manage
>>>>> to figure out what expression to use to get it to index what I want.
>>>>>  In the Job settings under Paths, I have specified the parent folder,
>>>>> and within there I've tried:
>>>>>  1. Include file(s) or directory(s) matching * (this works, but
>>>>> indexes every file in every folder within the parent, eventually causing 
>>>>> me
>>>>> unresolvable GC memory overhead errors)
>>>>> 2. Include file(s) or directory(s) matching ^(?i)[A-G]* (this does not
>>>>> work; it supposedly indexes one file and then quits)
>>>>> 3. Include file(s) or directory(s) matching A* (this does not work; it
>>>>> supposedly indexes one file and then quits, and there are many folders
>>>>> directly under the parent that begin with 'A')
>>>>>  Can anyone help confirm what type of expression I should use in the
>>>>> paths to accomplish what I want?
>>>>>  Or alternately if you think I should be able to index 160,000+ files
>>>>> in one job without getting GC memory overhead errors, I'm open to hear 
>>>>> your
>>>>> suggestions on resolving those. All I know to do is increase the maximum
>>>>> memory in Tomcat as well as on the OS, and that didn't help at all.
>>>>>  Thanks much!
>>>>>  -Ian
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Need examples of expressions used to specify multiple folders to index

Reply via email to