content detection problem using tika-app

John M Sun, 20 Nov 2011 11:28:06 -0800

Hello,

I have a .ppt file that I've renamed to be a .doc file (by only
changing its extension).  If I use the Tika GUI, or the command line,
to extract the file metadata, then Tika correctly identifies the
content type as a Powerpoint file.  However, if I use the command line
-d option to detect its content type, the application returns
"application/msword", which is of course only superficially correct.
The source code indicates that the correct type comes from a call to a
parser's parse method, while the less-accurate detection comes from a
call to a detector's detect method.  I'm not sure if this is a feature
or a bug--I didn't see anything similar when browsing through JIRA--so
I thought I'd ask if the project team is aware of the detector's
performance vs the parser's performance on detecting content types
before I or someone else would create a bug report / feature request
in JIRA.


Thanks,
John Mastarone

content detection problem using tika-app

Reply via email to