Hi Ian, ManifoldCF operates under what is known as a "bounded" memory model. That means that you should always be able to find a memory size that works (that isn't huge).
The only exception to this is for Solr indexing that does *not* go via the extracting update handler. The standard update handler unfortunately *requires* that the entire document fit in memory. If this is what you are doing, you must take steps to limit the maximum document size to prevent OOM's. 160,000 documents is quite small by MCF standards (we do 10 million to 50 million on some setups). So let's diagnose your problem before taking any bizarre actions. Can you provide an out-of-memory dump from the log, for instance? Can you let us know what deployment model you are using (e.g. single-process, etc.)? Thanks, Karl On Thu, Mar 19, 2015 at 3:07 PM, Ian Zapczynski < [email protected]> wrote: > Hello all. I am using ManifoldCF to index a Windows share containing > well over 160,000 files (.xls, .pdf, .doc). I keep getting memory errors > when I try to index the whole folder at once and have not been able to > resolve this by throwing memory and CPU at Tomcat and the VM, so I thought > I'd try this a different way. > > What I'd like to do now is break what was a single job up into multiple > jobs. Each job should index all indexable files under a parent folder, > with one job indexing folders whose names begin with the letters A-G as > well as all subfolders and files within, another job for H-M also with all > subfolders/files, and so on. My problem is, somehow I can't manage to > figure out what expression to use to get it to index what I want. > > In the Job settings under Paths, I have specified the parent folder, and > within there I've tried: > > 1. Include file(s) or directory(s) matching * (this works, but indexes > every file in every folder within the parent, eventually causing me > unresolvable GC memory overhead errors) > 2. Include file(s) or directory(s) matching ^(?i)[A-G]* (this does not > work; it supposedly indexes one file and then quits) > 3. Include file(s) or directory(s) matching A* (this does not work; it > supposedly indexes one file and then quits, and there are many folders > directly under the parent that begin with 'A') > > Can anyone help confirm what type of expression I should use in the paths > to accomplish what I want? > > Or alternately if you think I should be able to index 160,000+ files in > one job without getting GC memory overhead errors, I'm open to hear your > suggestions on resolving those. All I know to do is increase the maximum > memory in Tomcat as well as on the OS, and that didn't help at all. > > Thanks much! > > -Ian >
