Re: Need examples of expressions used to specify multiple folders to index

Karl Wright Tue, 24 Mar 2015 08:24:45 -0700

" failure processing document Server athttp://localhost:8983/solr returned
non ok status:500, message:Server Error""


That's an error occurring on Solr.  What do the Solr logs say?

Karl


On Tue, Mar 24, 2015 at 11:11 AM, Ian Zapczynski <
[email protected]> wrote:

>  Unfortunately I'm still getting stuck indexing.   It at least appears to
> me that I have such a large number of password protected docs and scanned
> PDFs without OCR enabled that the job is dying on me before it even finds
> all the "good" docs.    It will die with the error "Error: Repeated service
> interruptions - failure processing document Server at
> http://localhost:8983/solr returned non ok status:500, message:Server
> Error".   My job tells me there are 168,595 documents, with 73,238
> currently active, and 106,291 processed.    At this point, if I keep
> restarting the job, it slowly adds a small number of new docs, but then
> dies again with the same error.    The thing is, it has not indexed a large
> number of documents that should be indexable.
>
> To help clarify, it may be helpful to note that I am indexing a folder
> that contains thousands of folders within that which are named after
> companies we have associated with, with several files and folders within
> each.   If I test searches in SOLR by reviewing a document and then
> performing a search based on that text, I get the expected search results
> consistently when I am searching files that are within folders with company
> names beginning with A, B, C, D, etc.   However, I do not get results if I
> search for files within folders with company names beginning with
> R,S,T,U,V, etc.
>
> We are looking into whether we should batch convert the scanned PDFs to
> support OCR and thereby cut down on the number of problem docs, but for
> now, I'd like to just get all of the indexable documents into SOLR.
>
> Going back to my original question, should I consider breaking this single
> job into multiple jobs based on the letter of the alphabet?   If so, I
> haven't been able to figure out a working regex to tell it to just pick up
> all files and folders within a folder which name begins with R-Z, for
> example.    And if not that workaround, where do you suggest I go to
> resolve this?    I'm not entirely sure what is causing all of the files and
> folders not to be traversed before my job dies (is this a ManifoldCF thing
> or a SOLR thing?)
>
> Thanks again for your help.
>
>
> >>> Ian Zapczynski 3/20/2015 2:25 PM >>>
> Thanks for the help, Karl.    Yup, I was using the simple-to-set-up single
> process configuration, and silly me.... after I restarted from scratch at
> one point, I completely failed to update the combined-options-env.win
> config file that you referred to, so MCF was still set to use only 256 Mb
> despite my thinking otherwise.   I've bumped it up to 4 Gb, and the job
> recovered and is finally again moving along.
> -Ian
>
> >>> Karl Wright <[email protected]> 3/20/2015 10:55 AM >>>
> Hi Ian,
>
> HSQLDB is an interesting database in that it is *not* memory constrained.
> It attempts to keep everything in memory.
>
> I'd strongly suggest either giving the MCF agents process a lot more
> memory, say 2G, if you want to keep using hsqldb. A better choice would be
> postgresql or mysql. There's a configuration file where you can put java
> switches for all of the processes; start by doing that.
>
> Thanks,
> Karl
>
>
>
>
> On Fri, Mar 20, 2015 at 9:29 AM, Ian Zapczynski <
> [email protected]> wrote:
>
>>  Hi Karl,
>>  I have SOLR and ManifoldCF running with Tomcat on a Windows 2012 R2
>> server. Linux would have been my preference, but various logistics
>> prevented me from using that. I have set the maximum document length to be
>> 3072000. I chose a larger size than what might be normal because when I
>> first did a test, I could see that a lot of docs were getting rejected
>> based on size, and it seems folks around here don't reduce/shrink the size
>> of their PDFs.
>>  The errors from the log are below. I was more busy paying attention to
>> the errors spit out to the console, which didn't so obviously point to the
>> backend database being the culprit. I'm guessing that I'm pushing the
>> database too hard and should really be using PostgreSQL, right? I don't
>> know why, but I didn't see or reference the deployment documentation that
>> covered using various other databases until now. I was working off of the
>> ManifoldCF End User Documentation as well as a (mostly) helpful blog post I
>> found elsewhere.
>>  Much thanks,
>>  -Ian
>>  WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Solr exception
>> during indexing
>> file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file1.pdf
>> (500): Server at http://localhost:8983/solr returned non ok status:500,
>> message:Server Error
>> org.apache.solr.common.SolrException: Server at
>> http://localhost:8983/solr returned non ok status:500, message:Server
>> Error
>> at
>> org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:303)
>> at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
>> at
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
>> at
>> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:894)
>> WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Service interruption
>> reported for job 1426796577848 connection 'MACLSTR file server': Solr
>> exception during indexing
>> file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file3.pdf
>> (500): Server at http://localhost:8983/solr returned non ok status:500,
>> message:Server Error
>> ERROR 2015-03-19 18:31:45,730 (Job delete thread) - Job delete thread
>> aborting and restarting due to database connection reset: Database
>> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC
>> overhead limit exceeded
>> ERROR 2015-03-19 18:31:45,309 (Finisher thread) - Finisher thread
>> aborting and restarting due to database connection reset: Database
>> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC
>> overhead limit exceeded
>> ERROR 2015-03-19 18:31:43,043 (Set priority thread) - Set priority thread
>> aborting and restarting due to database connection reset: Database
>> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC
>> overhead limit exceeded
>> ERROR 2015-03-19 18:32:02,292 (Job notification thread) - Job
>> notification thread aborting and restarting due to database connection
>> reset: Database exception: SQLException doing query (S1000):
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>> FATAL 2015-03-19 18:32:05,870 (Thread-3838608) -
>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>> 531146
>> WARN 2015-03-19 18:32:09,167 (Assessment thread) - Found a long-running
>> query (64919 ms): [SELECT id,status,connectionname FROM jobs WHERE
>> assessmentstate=? FOR UPDATE]
>> WARN 2015-03-19 18:32:09,167 (Assessment thread) - Parameter 0: 'N'
>> ERROR 2015-03-19 18:32:09,167 (Job reset thread) - Job reset thread
>> aborting and restarting due to database connection reset: Database
>> exception: SQLException doing query (S1000): java.lang.RuntimeException:
>> Logging failed when attempting to log:
>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>> 531146 java.lang.RuntimeException: Logging failed when attempting to log:
>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>> 531146
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
>> exception: SQLException doing query (S1000): java.lang.RuntimeException:
>> Logging failed when attempting to log:
>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>> 531146 java.lang.RuntimeException: Logging failed when attempting to log:
>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>> 531146
>> at
>> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:702)
>> at
>> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728)
>> at
>> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:771)
>> at
>> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444)
>> at
>> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)
>> at
>> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191)
>> at
>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performModification(DBInterfaceHSQLDB.java:750)
>> at
>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performUpdate(DBInterfaceHSQLDB.java:296)
>> at
>> org.apache.manifoldcf.core.database.BaseTable.performUpdate(BaseTable.java:80)
>> at
>> org.apache.manifoldcf.crawler.jobs.JobQueue.noDocPriorities(JobQueue.java:967)
>> at
>> org.apache.manifoldcf.crawler.jobs.JobManager.noDocPriorities(JobManager.java:8148)
>> at
>> org.apache.manifoldcf.crawler.jobs.JobManager.finishJobStops(JobManager.java:8123)
>> at
>> org.apache.manifoldcf.crawler.system.JobResetThread.run(JobResetThread.java:69)
>> Caused by: java.sql.SQLException: java.lang.RuntimeException: Logging
>> failed when attempting to log:
>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>> 531146 java.lang.RuntimeException: Logging failed when attempting to log:
>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>> 531146
>> at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
>> at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
>> at org.hsqldb.jdbc.JDBCPreparedStatement.fetchResult(Unknown Source)
>> at org.hsqldb.jdbc.JDBCPreparedStatement.executeUpdate(Unknown Source)
>> at org.apache.manifoldcf.core.database.Database.execute(Database.java:903)
>> at
>> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:683)
>> Caused by: org.hsqldb.HsqlException: java.lang.RuntimeException: Logging
>> failed when attempting to log:
>> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
>> 531146
>> at org.hsqldb.error.Error.error(Unknown Source)
>> at org.hsqldb.result.Result.newErrorResult(Unknown Source)
>> at org.hsqldb.StatementDMQL.execute(Unknown Source)
>> at org.hsqldb.Session.executeCompiledStatement(Unknown Source)
>> at org.hsqldb.Session.execute(Unknown Source)
>> ... 4 more
>> Caused by: java.lang.RuntimeException: Logging failed when attempting to
>> log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of
>> mem 531146
>> at org.hsqldb.lib.FrameworkLogger.privlog(Unknown Source)
>> at org.hsqldb.lib.FrameworkLogger.severe(Unknown Source)
>> at org.hsqldb.persist.Logger.logSevereEvent(Unknown Source)
>> at org.hsqldb.persist.DataFileCache.logSevereEvent(Unknown Source)
>> at org.hsqldb.persist.DataFileCache.getFromFile(Unknown Source)
>> at org.hsqldb.persist.DataFileCache.get(Unknown Source)
>> at org.hsqldb.persist.RowStoreAVLDisk.get(Unknown Source)
>> at org.hsqldb.index.NodeAVLDisk.findNode(Unknown Source)
>> at org.hsqldb.index.NodeAVLDisk.getRight(Unknown Source)
>> at org.hsqldb.index.IndexAVL.next(Unknown Source)
>> at org.hsqldb.index.IndexAVL.next(Unknown Source)
>> at org.hsqldb.index.IndexAVL$IndexRowIterator.getNextRow(Unknown Source)
>> at org.hsqldb.RangeVariable$RangeIteratorMain.findNext(Unknown Source)
>> at org.hsqldb.RangeVariable$RangeIteratorMain.next(Unknown Source)
>> at org.hsqldb.StatementDML.executeUpdateStatement(Unknown Source)
>> at org.hsqldb.StatementDML.getResult(Unknown Source)
>> ... 7 more
>> Caused by: java.lang.reflect.InvocationTargetException
>> at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:483)
>> ... 23 more
>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>> WARN 2015-03-19 18:32:09,167 (Assessment thread) - Plan:
>> isDistinctSelect=[false]
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: isGrouped=[false]
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan:
>> isAggregated=[false]
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: columns=[
>> COLUMN: PUBLIC.JOBS.ID not nullable
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: COLUMN:
>> PUBLIC.JOBS.STATUS not nullable
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: COLUMN:
>> PUBLIC.JOBS.CONNECTIONNAME not nullable
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan:
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: [range variable 1
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: join type=INNER
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: table=JOBS
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: cardinality=5
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: access=FULL SCAN
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: join condition =
>> [index=SYS_IDX_SYS_PK_10234_10237
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: other condition=[
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: EQUAL arg_left=[
>> COLUMN: PUBLIC.JOBS.ASSESSMENTSTATE
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ] arg_right=[
>> DYNAMIC PARAM: , TYPE = CHARACTER
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]]
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]]
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: PARAMETERS=[
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: @0[DYNAMIC
>> PARAM: , TYPE = CHARACTER
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]]
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: SUBQUERIES[]
>> WARN 2015-03-19 18:32:09,182 (Assessment thread) -
>> FATAL 2015-03-19 18:32:09,198 (Job notification thread) -
>> JobNotificationThread initialization error tossed: GC overhead limit
>> exceeded
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>> FATAL 2015-03-19 18:32:09,198 (Set priority thread) - SetPriorityThread
>> initialization error tossed: GC overhead limit exceeded
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>> FATAL 2015-03-19 18:32:09,198 (Finisher thread) - FinisherThread
>> initialization error tossed: GC overhead limit exceeded
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>> FATAL 2015-03-19 18:32:09,198 (Job delete thread) - JobDeleteThread
>> initialization error tossed: GC overhead limit exceeded
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>> FATAL 2015-03-19 18:32:09,198 (Seeding thread) - SeedingThread
>> initialization error tossed: GC overhead limit exceeded
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>
>> >>> Karl Wright <[email protected]> 3/19/2015 3:34 PM >>>
>>  Hi Ian,
>>
>> ManifoldCF operates under what is known as a "bounded" memory model. That
>> means that you should always be able to find a memory size that works (that
>> isn't huge).
>>
>> The only exception to this is for Solr indexing that does *not* go via
>> the extracting update handler. The standard update handler unfortunately
>> *requires* that the entire document fit in memory. If this is what you are
>> doing, you must take steps to limit the maximum document size to prevent
>> OOM's.
>>
>> 160,000 documents is quite small by MCF standards (we do 10 million to 50
>> million on some setups). So let's diagnose your problem before taking any
>> bizarre actions. Can you provide an out-of-memory dump from the log, for
>> instance? Can you let us know what deployment model you are using (e.g.
>> single-process, etc.)?
>>
>> Thanks,
>> Karl
>>
>>
>> On Thu, Mar 19, 2015 at 3:07 PM, Ian Zapczynski <
>> [email protected]> wrote:
>>
>>>  Hello all. I am using ManifoldCF to index a Windows share containing
>>> well over 160,000 files (.xls, .pdf, .doc). I keep getting memory errors
>>> when I try to index the whole folder at once and have not been able to
>>> resolve this by throwing memory and CPU at Tomcat and the VM, so I thought
>>> I'd try this a different way.
>>>  What I'd like to do now is break what was a single job up into
>>> multiple jobs. Each job should index all indexable files under a parent
>>> folder, with one job indexing folders whose names begin with the letters
>>> A-G as well as all subfolders and files within, another job for H-M also
>>> with all subfolders/files, and so on. My problem is, somehow I can't manage
>>> to figure out what expression to use to get it to index what I want.
>>>  In the Job settings under Paths, I have specified the parent folder,
>>> and within there I've tried:
>>>  1. Include file(s) or directory(s) matching * (this works, but indexes
>>> every file in every folder within the parent, eventually causing me
>>> unresolvable GC memory overhead errors)
>>> 2. Include file(s) or directory(s) matching ^(?i)[A-G]* (this does not
>>> work; it supposedly indexes one file and then quits)
>>> 3. Include file(s) or directory(s) matching A* (this does not work; it
>>> supposedly indexes one file and then quits, and there are many folders
>>> directly under the parent that begin with 'A')
>>>  Can anyone help confirm what type of expression I should use in the
>>> paths to accomplish what I want?
>>>  Or alternately if you think I should be able to index 160,000+ files
>>> in one job without getting GC memory overhead errors, I'm open to hear your
>>> suggestions on resolving those. All I know to do is increase the maximum
>>> memory in Tomcat as well as on the OS, and that didn't help at all.
>>>  Thanks much!
>>>  -Ian
>>>
>>
>>
>

Re: Need examples of expressions used to specify multiple folders to index

Reply via email to