Jukka, Doh! I had been using the BodyContentHandler and did not realize that this is where the XHTML document gets sent (with a caveat). You must use a different type of content handler, so I'm using the ToXMLContentHandler and that seems to be doing the trick!
I knew I was missing something simple, I have resolved this issue. Thanks again for your help. -Arthur On Dec 2, 2011, at 6:15 PM, Arthur Meneau wrote: > Thank you Jukka! > > I am not that familiar with SAX events and can't seem to figure out where the > XHTML document gets returned to. I haven't had enough time to dig deeper > into this quite yet, but the following questions would help speed up my > understanding. > > The API documentation for ForkParser seems to say that parse is a void method > and doesn't return anything, so I'm assuming it populates one of the other > objects that gets passed to the parse method. Is this the case, and if so, > which one? > > Thank you again for your quick response to both this question and on > stackoverflow (which I had posted and you answered and referenced). I will > definitely file a tika improvement issue right now! > > Regards, > -Arthur Meneau > > On Dec 2, 2011, at 1:50 AM, Jukka Zitting wrote: > >> Hi, >> >> On Thu, Dec 1, 2011 at 11:57 PM, Arthur Meneau <[email protected]> wrote: >>> We are primarily using Tika to detect the content-type of files. >> >> Instead of using a full parser for this, I suggest using the detect() >> methods of the Tika facade [1]. Those are typically much faster and >> memory-efficient. >> >> See my answer on StackOverflow [2] for the reason why the ForkParser >> currently doesn't return extracted metadata. Quoting: >> >> "The ForkParser class in Tika 1.0 unfortunately does not support >> metadata extraction since for now the communication channel to the >> forked parser process only supports passing back SAX events but not >> metadata entries. I suggest you file a TIKA improvement issue to get >> this fixed. >> >> One workaround you might want to consider is getting the extracted >> metadata from the <meta> tags in the <head> section of the XHTML >> document returned by the forked parser. Those should be available and >> contain most of the metadata entries normally returned in the Metadata >> object." >> >> [1] http://tika.apache.org/1.0/api/org/apache/tika/Tika.html >> [2] >> http://stackoverflow.com/questions/8349898/why-is-my-tika-metadata-object-not-being-populated-when-using-forkparser/8354392#8354392 >> >> BR, >> >> Jukka Zitting >
