Hello :-)
Using Xapian-Omega's omindex binary to run Tika on 400 files, Tika gives
the error in the subject 247 times. The files triggering the error have
extensions doc, pdf, ppt, rtf and
xls so the problem is probably not specific to the file type.
Running vmstat with a 1 second delay during the omindex run shows no
swapping and consistently ~0.5GB (of 1 GB) free memory so the problem is
not memory.
The bash ulimit command reported "unlimited" and
/etc/security/limits.conf is all comments or empty lines.
Netsearching widely (most informative pages listed below sig) suggested
adjusting Java memory spaces but neither export JAVA_OPTS='-Xms256m
-Xmx512m' nor export JAVA_OPTS='-Xmx512m' before running omindex fixed
the problem. I do not know what the defaults are.
Tika worked on this development system from installation on 31mar11
until it was last used on 14apr11. All system changes are logged but
none of the changes since 14apr11 are obviously relevant. Tomcat 6 was
installed for GeoServer and this did take ~1 GB virtual memory, perhaps
triggering the problem, but it and MySQL have since been disabled in the
boot scripts and the system rebooted. Tika is still working on the live
system which is similar to the development system in terms of installed
software and versions.
I wanted to try with Tika 0.9 but it failed bmp, jpeg and png parsing
tests during installation by Maven. I do not know enough Java/Maven to
see if the errors are related.
The OS is Debian Squeeze 64 bit running in a virtual machine -- hence
the small sample of 400 files and the 1 GB memory -- running headless.
What to do for more analysis and hopefully a fix?
Best
Charles
*Good pages re "OutOfMemoryError/Out of swap space?*":
* JVM Lies: The OutOfMemory Myth:
http://www.codingthearchitecture.com/2008/01/14/jvm_lies_the_outofmemory_myth.html
* http://www.oracle.com/technetwork/java/javase/memleaks-137499.html#gbyvj
* Troubleshooting Guide for Java SE 6 with HotSpot VM:
http://www.oracle.com/technetwork/java/javase/memleaks-137499.html#gbyvj