Hello all. I am using ManifoldCF to index a Windows share containing well
over 160,000 files (.xls, .pdf, .doc). I keep getting memory errors when I
try to index the whole folder at once and have not been able to resolve this by
throwing memory and CPU at Tomcat and the VM, so I thought I'd try this a
different way.
What I'd like to do now is break what was a single job up into multiple jobs.
Each job should index all indexable files under a parent folder, with one job
indexing folders whose names begin with the letters A-G as well as all
subfolders and files within, another job for H-M also with all
subfolders/files, and so on. My problem is, somehow I can't manage to figure
out what expression to use to get it to index what I want.
In the Job settings under Paths, I have specified the parent folder, and within
there I've tried:
1. Include file(s) or directory(s) matching * (this works, but indexes every
file in every folder within the parent, eventually causing me unresolvable GC
memory overhead errors)
2. Include file(s) or directory(s) matching ^(?i)[A-G]* (this does not work;
it supposedly indexes one file and then quits)
3. Include file(s) or directory(s) matching A* (this does not work; it
supposedly indexes one file and then quits, and there are many folders directly
under the parent that begin with 'A')
Can anyone help confirm what type of expression I should use in the paths to
accomplish what I want?
Or alternately if you think I should be able to index 160,000+ files in one job
without getting GC memory overhead errors, I'm open to hear your suggestions on
resolving those. All I know to do is increase the maximum memory in Tomcat as
well as on the OS, and that didn't help at all.
Thanks much!
-Ian