Greetings... I am new to Tika and I am trying to detect the
internal doc format of an ooxml container/file
When I call detect (InputStream, String) in a new Ticka() instance, it
appears I can fool the detector(s) by changing the file extension of a
docx file to xlsx...the detection returns
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Since in the code comments use the word 'hint' to describe the use of
resource names during detection, I was hoping that the hint itself was
taken lightly: advisory
Our application accepts a very limited set of file extensions, and we
have to expect that some users will solve any conundrums about file
formats by renaming their files to meet the requirements.
I think I've all the jars (including transient dep's) piled onto the
classpath so that the more rigorous detection can take place...I've
gone thru the list of jars in the 1.0 gettingstarted.html doc twice to
make sure they are all listed in the eclipse classpath.... I just
don't know if what I am seeing is consistent with missing jars or not.
I done some debugging and see a very long list of Magics, but, again,
don;t know if that is core or not.... should I see a long list of
detectors as well?
Any help offered would be appreciated
--
Jon Gorrono
PGP Key: 0x5434509D -
http{pgp.mit.edu:11371/pks/lookup?search=0x5434509D&op=index}
GSWoT Introducer - {GSWoT:US75 5434509D Jon P. Gorrono <jpgorrono -
www.gswot.org>}
http{middleware.ucdavis.edu}