Hi, On Thu, Dec 1, 2011 at 11:57 PM, Arthur Meneau <[email protected]> wrote: > We are primarily using Tika to detect the content-type of files.
Instead of using a full parser for this, I suggest using the detect() methods of the Tika facade [1]. Those are typically much faster and memory-efficient. See my answer on StackOverflow [2] for the reason why the ForkParser currently doesn't return extracted metadata. Quoting: "The ForkParser class in Tika 1.0 unfortunately does not support metadata extraction since for now the communication channel to the forked parser process only supports passing back SAX events but not metadata entries. I suggest you file a TIKA improvement issue to get this fixed. One workaround you might want to consider is getting the extracted metadata from the <meta> tags in the <head> section of the XHTML document returned by the forked parser. Those should be available and contain most of the metadata entries normally returned in the Metadata object." [1] http://tika.apache.org/1.0/api/org/apache/tika/Tika.html [2] http://stackoverflow.com/questions/8349898/why-is-my-tika-metadata-object-not-being-populated-when-using-forkparser/8354392#8354392 BR, Jukka Zitting
