Re: Constraining Tika's memory usage (using ForkParser possibly?)

Arthur Meneau Mon, 05 Dec 2011 10:12:55 -0800

Jukka,

Doh! I had been using the BodyContentHandler and did not realize that this is 
where the XHTML document gets sent (with a caveat).  You must use a different 
type of content handler, so I'm using the ToXMLContentHandler and that seems to 
be doing the trick!


I knew I was missing something simple, I have resolved this issue.

Thanks again for your help.
-Arthur

On Dec 2, 2011, at 6:15 PM, Arthur Meneau wrote:

> Thank you Jukka!
> 
> I am not that familiar with SAX events and can't seem to figure out where the 
> XHTML document gets returned to.  I haven't had enough time to dig deeper 
> into this quite yet, but the following questions would help speed up my 
> understanding.
> 
> The API documentation for ForkParser seems to say that parse is a void method 
> and doesn't return anything, so I'm assuming it populates one of the other 
> objects that gets passed to the parse method. Is this the case, and if so, 
> which one?
> 
> Thank you again for your quick response to both this question and on 
> stackoverflow (which I had posted and you answered and referenced).  I will 
> definitely file a tika improvement issue right now!
> 
> Regards,
> -Arthur Meneau
> 
> On Dec 2, 2011, at 1:50 AM, Jukka Zitting wrote:
> 
>> Hi,
>> 
>> On Thu, Dec 1, 2011 at 11:57 PM, Arthur Meneau <[email protected]> wrote:
>>> We are primarily using Tika to detect the content-type of files.
>> 
>> Instead of using a full parser for this, I suggest using the detect()
>> methods of the Tika facade [1]. Those are typically much faster and
>> memory-efficient.
>> 
>> See my answer on StackOverflow [2] for the reason why the ForkParser
>> currently doesn't return extracted metadata. Quoting:
>> 
>> "The ForkParser class in Tika 1.0 unfortunately does not support
>> metadata extraction since for now the communication channel to the
>> forked parser process only supports passing back SAX events but not
>> metadata entries. I suggest you file a TIKA improvement issue to get
>> this fixed.
>> 
>> One workaround you might want to consider is getting the extracted
>> metadata from the <meta> tags in the <head> section of the XHTML
>> document returned by the forked parser. Those should be available and
>> contain most of the metadata entries normally returned in the Metadata
>> object."
>> 
>> [1] http://tika.apache.org/1.0/api/org/apache/tika/Tika.html
>> [2] 
>> http://stackoverflow.com/questions/8349898/why-is-my-tika-metadata-object-not-being-populated-when-using-forkparser/8354392#8354392
>> 
>> BR,
>> 
>> Jukka Zitting
>

Re: Constraining Tika's memory usage (using ForkParser possibly?)

Reply via email to