Constraining Tika's memory usage (using ForkParser possibly?)

Arthur Meneau Thu, 01 Dec 2011 14:57:53 -0800

I’m having an issue with Tika’s memory usage when parsing files. The problem 
really stems from the fact that Tika cannot be allowed to eat up all the 
available memory on the system that will be doing file parsing and metadata 
extraction. We are primarily using Tika to detect the content-type of files. 
The other metadata properties that Tika is able to extract are also valuable to 
us but not as necessary.


I’ve been down several paths already to see if there would be a simple way to 
resolve this issue.  I’m catching Throwable to prevent an OutOfMemoryError from 
taking down our app, but this still means that there’s a period of time where 
Tika is using all available memory and could cause slow downs in other parts of 
our application, so this is an acceptable interim solution, but needs to be 
replaced with a better solution as soon as there is one available.

I also looked into using ForkParser instead of the AutoDetectParser. ForkParser 
would be an ideal solution, allowing us to constrain the forked jvm’s memory. 
However, I cannot seem to get it to collect metadata correctly. In fact, it 
doesn’t seem to be extracting any metadata at all (not even the content-type!), 
even though the associated content handler (BodyContentHandler) after calling 
the parse method is non-null.

I peeked inside of the source and noticed that the test for ForkParser never 
actually verifies that the metadata object is ever populated, I'm not sure if 
that is intentional or not, but like I mentioned, the metadata object is empty 
(aside from the RESOURCE_NAME_KEY that is passed to the metadata file as seen 
below).  Do you have any recommendations of how I might constrain how much 
memory Tika consumes?  I feel like ForkParser is the right track, but I'm open 
to changing this if there's another, better option for extracting metadata.  
Also, as far as I know, I'm not interested in what ends up in the 
BodyContentHandler, I'm really only interested in the metadata from the file.

I’m using Tika 1.0’s ForkParser in the following fashion:
        public static Metadata getMetadata(File f) {
                Metadata metadata = new Metadata();
                try {
                        FileInputStream fis = new FileInputStream(f);
                        BodyContentHandler contentHandler = new 
BodyContentHandler(-1);
                        ParseContext context     = new ParseContext();
                        ForkParser parser = new ForkParser();
                        
                        parser.setJavaCommand("/usr/local/java6/bin/java -cp 
-Xmx64m");
                        metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName());

                        parser.parse(fis, contentHandler, metadata, context);
                        fis.close();

                        String contentType = 
metadata.get(Metadata.CONTENT_TYPE);
                        
                        logger.error("contentHandler: " + 
contentHandler.toString());
                        logger.error("metadata: " + metadata.toString());

                        return metadata;

                } catch (Throwable e) {
                        logger.error("Exception while analyzing file\n" +
                        "CAUTION: metadata may still have useful content in 
it!\n" +
                        "Exception: " + e, e);

                        return metadata;
                }
        }

Thanks!
-Arthur Meneau

Constraining Tika's memory usage (using ForkParser possibly?)

Reply via email to