On 17/06/11 11:04, Charles wrote:
> On 16/06/11 16:56, Jukka Zitting wrote:
>> Hi,
>>
>> On Thu, Jun 16, 2011 at 8:55 AM, Charles<[email protected]>  wrote:
>>> >  The problem was fixed by increasing the VM memory from 1 GB to 3 GB
>>> >  (intermediate sizes not explored, JAVA_OPTS fix attempts backed
>>> out) so it
>>> >  seems it really was a memory shortage despite top's and vmstat's
>>> >  re-assurance.  I wonder what triggered it.
>> It sounds unlikely for Tika to be using that much memory unless you're
>> processing some huge documents.
>>
>> To better investigate the issue you could start your JVM with the
>> -XX:+HeapDumpOnOutOfMemoryError option, and inspect the heap dump that
>> gets created when an OOM error is encountered.
>>
>> Alternatively, you can try identifying the troublesome document by
>> running a script like the following:
>>
>>      for file in/path/to/documents/*; do
>>          echo $file
>>          java -Xmx100m -jar tika-app-0.9.jar $file>  /dev/null
>>      done
>>
>> The ForkParser feature introduced in Tika 0.9 can be used to run text
>> extraction in a background process so that a possible OOM error or
>> even a JVM crash won't affect your application.
>>
>> BR,
>>
>> Jukka Zitting
> Thanks Jukka :-)
>
> The biggest file is ~95 MB; most are under 1 MB.
>
> I set the VirtualBox VM's memory back to 1024 MB and tried to
> reproduce the problem but the behaviour has changed.  On the
> collection of documents used for development and testing, for the
> types that omindex uses Tika 0.8 to convert to text:
>
> doc files:  tried: 134, failed: 60  44.77%
> docx files: tried:   1, failed:  0
> odp files:  tried:   1, failed:  0
> ods files:  tried:  23, failed:  0
> odt files:  tried:  71, failed:  0
> pdf files:  tried:  81, failed: 81 100.00%
> ppt files:  tried:   4, failed:  4 100.00%
>
> rtf files:  tried:   2, failed:  2 100.00%
> xls files:  tried:  27, failed: 27 100.00%
>
> The  java.lang.OutOfMemoryError message no longer appears.  Some now
> generate std::bad_alloc messages but most simply "Aborted" (IDK
> whether that message is from omindex or Tika).
>
> omindex monitors Tika's behaviour (return code?).  When it detects a
> failure it logs the failing Tika command.  Running a sample of those
> commands at the command prompt does not produce the errors.  It is
> beginning to look as if the problem is caused by the environment that
> omindex sets up for Tika to run in.  There is a several year old
> omindex bug report (link not to hand) about omindex restricting memory
> for its filter programs.
>
> I'll take this to the Xapian+Omega mailing list (omidex is part of
> Omega).
>
> Best
>
> Charles
>
>
Hello :-)

Update. 

This is not an issue for us because we can work around it by increasing
memory; it only happens on the development VirtualBox VM when it is
configured with its usual 1024 MB memory (workaround by 3072 MB).  But I
understand the symptoms illustrate a potentially serious issue, hence
this message.  I'm out of my knowledge-zone here, knowing very little
about Java, so this is as best I understand it ... 

Although Xapian's omindex does restrict memory usage by the commands it
uses to extract text for indexing, those commands "ought" to behave well
if they run out of memory.  In the case of a Java app such as Tika, the
JVM should do so, too.

Here is from the Xapian mailing list with Olly one of the main (?)
Xapian developers responding to my message.  The full text is at
http://permalink.gmane.org/gmane.comp.search.xapian.general/8892.

===== extract begins =====

These are starting to sound like they're mostly Tika and/or Java issues,
though maybe it's an issue with how we set the resource limits, or we
might need to provide a workaround if Java doesn't handle such limits
well.

> I installed Tika 0.9 and tried again with the VirtualBox VM memory at  
> 1024 MB.
>
> Ran omindex several times, running "*rm /var/lib/omega/data//docoll//*"  
> before each run.  The output varied, including zero to two messages from  
> glibc such as these samples gathered over around 20 runs:
>
> ***** glibc detected ***** java: double free or corruption (!prev):  
> 0x0000000000642b40 *****
> ***** glibc detected ***** java: free(): invalid pointer: 0x000000000242e460 
> *****
> ***** glibc detected ***** java: double free or corruption (!prev):  
> 0x0000000001697b40 *****
> ***** glibc detected ***** java: double free or corruption (fasttop):  
> 0x0000000000b33d50 *****
> ***** glibc detected ***** java: double free or corruption (!prev):  
> 0x0000000000f03b30 *****
> ***** glibc detected ***** java: free(): invalid pointer: 0x0000000000dfe440 
> *****

Those sound like bugs in the JVM - potentially serious ones since double
free can lead to security vulnerabilities:

http://cwe.mitre.org/data/definitions/415.html

I guess it isn't handling running out of memory gracefully.

I don't know much about how JVMs set their memory limits by default (I
know you can specify on the command line), but perhaps the JVM is
looking at the limits omindex sets and basing decisions on these?

You could try disabling or changing omindex's limits - see runfilter.cc
for where the limit is set and freemem.cc for where the amount of
available memory is determined.

Or come at it from the other end and set resource limits similar to
those omindex is setting when running Tika from the shell.

===== extract ends =====

This is the omindex command, to show how Tika is called (after : )

omindex --db /var/lib/omega/data/docoll/ \
  --filter 'application/msword:java -jar /opt/apache/tika/tika-app-0.9.jar 
--text' \
  --filter 'application/octet-stream:strings -n8' \
  --filter 'application/pdf:java -jar /opt/apache/tika/tika-app-0.9.jar --text' 
\
  --filter 'application/vnd.ms-excel:java -jar 
/opt/apache/tika/tika-app-0.9.jar --text' \
  --filter 'application/vnd.ms-powerpoint:java -jar 
/opt/apache/tika/tika-app-0.9.jar --text' \
  --filter 'application/x-gzip:java -jar /opt/apache/tika/tika-app-0.9.jar 
--text' \
  --filter 'application/xml:java -jar /opt/apache/tika/tika-app-0.9.jar --text' 
\
  --filter 'application/x-rar:java -jar /opt/apache/tika/tika-app-0.9.jar 
--text' \
  --filter 'application/x-rar:java -jar /opt/apache/tika/tika-app-0.9.jar 
--text' \
  --filter 'application/x-zip:java -jar /opt/apache/tika/tika-app-0.9.jar 
--text' \
  --filter 'text/plain:cat' \
  --filter 'text/rtf:java -jar /opt/apache/tika/tika-app-0.9.jar --text' \
  --filter 'text/x-c:strings -n8' \
  --filter 'text/x-c++:strings -n8' \
  --stemmer=english \
  --url / \
  /srv/docoll/
        
*These are the Java versions being used on the development Debian Squeeze 64 
bit running in a VM:

c@CW8vDS:~$ java -version
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)

Best

Charles



* 


 

Reply via email to