I’m having an issue with Tika’s memory usage when parsing files. The problem
really stems from the fact that Tika cannot be allowed to eat up all the
available memory on the system that will be doing file parsing and metadata
extraction. We are primarily using Tika to detect the content-type of files.
The other metadata properties that Tika is able to extract are also valuable to
us but not as necessary.
I’ve been down several paths already to see if there would be a simple way to
resolve this issue. I’m catching Throwable to prevent an OutOfMemoryError from
taking down our app, but this still means that there’s a period of time where
Tika is using all available memory and could cause slow downs in other parts of
our application, so this is an acceptable interim solution, but needs to be
replaced with a better solution as soon as there is one available.
I also looked into using ForkParser instead of the AutoDetectParser. ForkParser
would be an ideal solution, allowing us to constrain the forked jvm’s memory.
However, I cannot seem to get it to collect metadata correctly. In fact, it
doesn’t seem to be extracting any metadata at all (not even the content-type!),
even though the associated content handler (BodyContentHandler) after calling
the parse method is non-null.
I peeked inside of the source and noticed that the test for ForkParser never
actually verifies that the metadata object is ever populated, I'm not sure if
that is intentional or not, but like I mentioned, the metadata object is empty
(aside from the RESOURCE_NAME_KEY that is passed to the metadata file as seen
below). Do you have any recommendations of how I might constrain how much
memory Tika consumes? I feel like ForkParser is the right track, but I'm open
to changing this if there's another, better option for extracting metadata.
Also, as far as I know, I'm not interested in what ends up in the
BodyContentHandler, I'm really only interested in the metadata from the file.
I’m using Tika 1.0’s ForkParser in the following fashion:
public static Metadata getMetadata(File f) {
Metadata metadata = new Metadata();
try {
FileInputStream fis = new FileInputStream(f);
BodyContentHandler contentHandler = new
BodyContentHandler(-1);
ParseContext context = new ParseContext();
ForkParser parser = new ForkParser();
parser.setJavaCommand("/usr/local/java6/bin/java -cp
-Xmx64m");
metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName());
parser.parse(fis, contentHandler, metadata, context);
fis.close();
String contentType =
metadata.get(Metadata.CONTENT_TYPE);
logger.error("contentHandler: " +
contentHandler.toString());
logger.error("metadata: " + metadata.toString());
return metadata;
} catch (Throwable e) {
logger.error("Exception while analyzing file\n" +
"CAUTION: metadata may still have useful content in
it!\n" +
"Exception: " + e, e);
return metadata;
}
}
Thanks!
-Arthur Meneau