Re: Constraining Tika's memory usage (using ForkParser possibly?)

Arthur Meneau Fri, 02 Dec 2011 18:15:44 -0800

Thank you Jukka!

I am not that familiar with SAX events and can't seem to figure out where the 
XHTML document gets returned to.  I haven't had enough time to dig deeper into 
this quite yet, but the following questions would help speed up my 
understanding.


The API documentation for ForkParser seems to say that parse is a void method 
and doesn't return anything, so I'm assuming it populates one of the other 
objects that gets passed to the parse method. Is this the case, and if so, 
which one?

Thank you again for your quick response to both this question and on 
stackoverflow (which I had posted and you answered and referenced).  I will 
definitely file a tika improvement issue right now!

Regards,
-Arthur Meneau

On Dec 2, 2011, at 1:50 AM, Jukka Zitting wrote:

> Hi,
> 
> On Thu, Dec 1, 2011 at 11:57 PM, Arthur Meneau <[email protected]> wrote:
>> We are primarily using Tika to detect the content-type of files.
> 
> Instead of using a full parser for this, I suggest using the detect()
> methods of the Tika facade [1]. Those are typically much faster and
> memory-efficient.
> 
> See my answer on StackOverflow [2] for the reason why the ForkParser
> currently doesn't return extracted metadata. Quoting:
> 
> "The ForkParser class in Tika 1.0 unfortunately does not support
> metadata extraction since for now the communication channel to the
> forked parser process only supports passing back SAX events but not
> metadata entries. I suggest you file a TIKA improvement issue to get
> this fixed.
> 
> One workaround you might want to consider is getting the extracted
> metadata from the <meta> tags in the <head> section of the XHTML
> document returned by the forked parser. Those should be available and
> contain most of the metadata entries normally returned in the Metadata
> object."
> 
> [1] http://tika.apache.org/1.0/api/org/apache/tika/Tika.html
> [2] 
> http://stackoverflow.com/questions/8349898/why-is-my-tika-metadata-object-not-being-populated-when-using-forkparser/8354392#8354392
> 
> BR,
> 
> Jukka Zitting

Re: Constraining Tika's memory usage (using ForkParser possibly?)

Reply via email to