Thank you Jukka! I am not that familiar with SAX events and can't seem to figure out where the XHTML document gets returned to. I haven't had enough time to dig deeper into this quite yet, but the following questions would help speed up my understanding.
The API documentation for ForkParser seems to say that parse is a void method and doesn't return anything, so I'm assuming it populates one of the other objects that gets passed to the parse method. Is this the case, and if so, which one? Thank you again for your quick response to both this question and on stackoverflow (which I had posted and you answered and referenced). I will definitely file a tika improvement issue right now! Regards, -Arthur Meneau On Dec 2, 2011, at 1:50 AM, Jukka Zitting wrote: > Hi, > > On Thu, Dec 1, 2011 at 11:57 PM, Arthur Meneau <[email protected]> wrote: >> We are primarily using Tika to detect the content-type of files. > > Instead of using a full parser for this, I suggest using the detect() > methods of the Tika facade [1]. Those are typically much faster and > memory-efficient. > > See my answer on StackOverflow [2] for the reason why the ForkParser > currently doesn't return extracted metadata. Quoting: > > "The ForkParser class in Tika 1.0 unfortunately does not support > metadata extraction since for now the communication channel to the > forked parser process only supports passing back SAX events but not > metadata entries. I suggest you file a TIKA improvement issue to get > this fixed. > > One workaround you might want to consider is getting the extracted > metadata from the <meta> tags in the <head> section of the XHTML > document returned by the forked parser. Those should be available and > contain most of the metadata entries normally returned in the Metadata > object." > > [1] http://tika.apache.org/1.0/api/org/apache/tika/Tika.html > [2] > http://stackoverflow.com/questions/8349898/why-is-my-tika-metadata-object-not-being-populated-when-using-forkparser/8354392#8354392 > > BR, > > Jukka Zitting
