On 16/06/11 16:56, Jukka Zitting wrote:
Hi,
On Thu, Jun 16, 2011 at 8:55 AM, Charles<[email protected]> wrote:
> The problem was fixed by increasing the VM memory from 1 GB to 3 GB
> (intermediate sizes not explored, JAVA_OPTS fix attempts backed out) so it
> seems it really was a memory shortage despite top's and vmstat's
> re-assurance. I wonder what triggered it.
It sounds unlikely for Tika to be using that much memory unless you're
processing some huge documents.
To better investigate the issue you could start your JVM with the
-XX:+HeapDumpOnOutOfMemoryError option, and inspect the heap dump that
gets created when an OOM error is encountered.
Alternatively, you can try identifying the troublesome document by
running a script like the following:
for file in/path/to/documents/*; do
echo $file
java -Xmx100m -jar tika-app-0.9.jar $file> /dev/null
done
The ForkParser feature introduced in Tika 0.9 can be used to run text
extraction in a background process so that a possible OOM error or
even a JVM crash won't affect your application.
BR,
Jukka Zitting
Thanks Jukka :-)
The biggest file is ~95 MB; most are under 1 MB.
I set the VirtualBox VM's memory back to 1024 MB and tried to reproduce
the problem but the behaviour has changed. On the collection of
documents used for development and testing, for the types that omindex
uses Tika 0.8 to convert to text:
doc files: tried: 134, failed: 60 44.77%
docx files: tried: 1, failed: 0
odp files: tried: 1, failed: 0
ods files: tried: 23, failed: 0
odt files: tried: 71, failed: 0
pdf files: tried: 81, failed: 81 100.00%
ppt files: tried: 4, failed: 4 100.00%
rtf files: tried: 2, failed: 2 100.00%
xls files: tried: 27, failed: 27 100.00%
The java.lang.OutOfMemoryError message no longer appears. Some now
generate std::bad_alloc messages but most simply "Aborted" (IDK whether
that message is from omindex or Tika).
omindex monitors Tika's behaviour (return code?). When it detects a
failure it logs the failing Tika command. Running a sample of those
commands at the command prompt does not produce the errors. It is
beginning to look as if the problem is caused by the environment that
omindex sets up for Tika to run in. There is a several year old omindex
bug report (link not to hand) about omindex restricting memory for its
filter programs.
I'll take this to the Xapian+Omega mailing list (omidex is part of Omega).
Best
Charles