Re: Constraining Tika's memory usage (using ForkParser possibly?)

Jukka Zitting Fri, 02 Dec 2011 01:51:24 -0800

Hi,

On Thu, Dec 1, 2011 at 11:57 PM, Arthur Meneau <[email protected]> wrote:
> We are primarily using Tika to detect the content-type of files.


Instead of using a full parser for this, I suggest using the detect()
methods of the Tika facade [1]. Those are typically much faster and
memory-efficient.

See my answer on StackOverflow [2] for the reason why the ForkParser
currently doesn't return extracted metadata. Quoting:

"The ForkParser class in Tika 1.0 unfortunately does not support
metadata extraction since for now the communication channel to the
forked parser process only supports passing back SAX events but not
metadata entries. I suggest you file a TIKA improvement issue to get
this fixed.

One workaround you might want to consider is getting the extracted
metadata from the <meta> tags in the <head> section of the XHTML
document returned by the forked parser. Those should be available and
contain most of the metadata entries normally returned in the Metadata
object."

[1] http://tika.apache.org/1.0/api/org/apache/tika/Tika.html
[2] 
http://stackoverflow.com/questions/8349898/why-is-my-tika-metadata-object-not-being-populated-when-using-forkparser/8354392#8354392

BR,

Jukka Zitting

Re: Constraining Tika's memory usage (using ForkParser possibly?)

Reply via email to