On 16/06/11 16:56, Jukka Zitting wrote:
Hi,

On Thu, Jun 16, 2011 at 8:55 AM, Charles<[email protected]>  wrote:
>  The problem was fixed by increasing the VM memory from 1 GB to 3 GB
>  (intermediate sizes not explored, JAVA_OPTS fix attempts backed out) so it
>  seems it really was a memory shortage despite top's and vmstat's
>  re-assurance.  I wonder what triggered it.
It sounds unlikely for Tika to be using that much memory unless you're
processing some huge documents.

To better investigate the issue you could start your JVM with the
-XX:+HeapDumpOnOutOfMemoryError option, and inspect the heap dump that
gets created when an OOM error is encountered.

Alternatively, you can try identifying the troublesome document by
running a script like the following:

     for file in/path/to/documents/*; do
         echo $file
         java -Xmx100m -jar tika-app-0.9.jar $file>  /dev/null
     done

The ForkParser feature introduced in Tika 0.9 can be used to run text
extraction in a background process so that a possible OOM error or
even a JVM crash won't affect your application.

BR,

Jukka Zitting
Thanks Jukka :-)

The biggest file is ~95 MB; most are under 1 MB.

I set the VirtualBox VM's memory back to 1024 MB and tried to reproduce the problem but the behaviour has changed. On the collection of documents used for development and testing, for the types that omindex uses Tika 0.8 to convert to text:

doc files:  tried: 134, failed: 60  44.77%
docx files: tried:   1, failed:  0
odp files:  tried:   1, failed:  0
ods files:  tried:  23, failed:  0
odt files:  tried:  71, failed:  0
pdf files:  tried:  81, failed: 81 100.00%
ppt files:  tried:   4, failed:  4 100.00%

rtf files:  tried:   2, failed:  2 100.00%
xls files:  tried:  27, failed: 27 100.00%

The java.lang.OutOfMemoryError message no longer appears. Some now generate std::bad_alloc messages but most simply "Aborted" (IDK whether that message is from omindex or Tika).

omindex monitors Tika's behaviour (return code?). When it detects a failure it logs the failing Tika command. Running a sample of those commands at the command prompt does not produce the errors. It is beginning to look as if the problem is caused by the environment that omindex sets up for Tika to run in. There is a several year old omindex bug report (link not to hand) about omindex restricting memory for its filter programs.

I'll take this to the Xapian+Omega mailing list (omidex is part of Omega).

Best

Charles


Reply via email to