On Sun, 20 Nov 2011, John M wrote:
I apologize; I took a closer look. I guess it's a matter of interpretation as to what the detector should be doing: in your example, Tika detected the correct format based off of the file name extensions, but, those copies you made weren't really PowerPoint or Excel files.
Ah, oops. More coffee needed! You're right, I wasn't seeing what I was expecting - the file should come back as a .doc no matter the filename, on the grounds of the content trumping the name
If you look at the TestMediaTypes class you'll see what you can get with just the mime magic and filenames, and then there's TestContainerAwareDetector which shows the correct detection happing by using the extra detectors available
Any chance you could open a bug for this? You're correct, and it really is a bug
Thanks Nick
