Thanks very much for the informative answer. I had a look at the MIME types list and there are 50 different "Office" formats, including many for Microsoft Word/Excel/Powerpoint!
Is there any recommended strategy for reliably detecting the correct media type of such files, in order to use POI afterwards, for content extraction? In other words, would you use Tika for detection and POI for extraction in such a scenario? On Fri, Jan 27, 2012 at 1:55 PM, Nick Burch <[email protected]> wrote: > On Fri, 27 Jan 2012, Public Network Services wrote: > >> More specifically, for the 3 basic MS-Office formats, I am now getting: >> >> - For *docx*: >> application/vnd.**openxmlformats-officedocument.** >> wordprocessingml.document >> - For *xlsx* >> : application/vnd.**openxmlformats-officedocument.**spreadsheetml.sheet >> - For *pptx* >> : vnd.openxmlformats-**officedocument.presentationml.**presentation >> - For *doc*: application/msword >> - For *xls*: application/vnd.ms-excel >> - For *ppt*: application/vnd.ms-powerpoint >> > > Those are all correct > > > Is that the only answer possible, or could there be another type returned >> for, say, Word? >> > > If memory serves, some of the office templates can have different > mimetypes to the normal file itself, and the macro enabled forms usually > have different mimetypes to the non-macro version. > > > > Where do I find all the Tika type declarations and names? >> > > The base set come from org/apache/tika/mime/tika-**mimetypes.xml > > The latest version of that can be seen in SVN at: > https://svn.apache.org/repos/**asf/tika/trunk/tika-core/src/** > main/resources/org/apache/**tika/mime/tika-mimetypes.xml<https://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml> > > Additionally, you can add extra custom mimetypes if you want, details are > at > http://tika.apache.org/1.0/**parser_guide.html#Add_your_**MIME-Type<http://tika.apache.org/1.0/parser_guide.html#Add_your_MIME-Type> > > Nick >
