I apologize; I took a closer look. I guess it's a matter of interpretation as to what the detector should be doing: in your example, Tika detected the correct format based off of the file name extensions, but, those copies you made weren't really PowerPoint or Excel files. If you run your test again with the -m option, the Content-Type field should display different results than what you see with --detect, and these are arguably better. I have a particular use case in mind where file names aren't necessarily to be trusted, so, maybe it's for the best that the detector can return a different result than the -m option; if this occurs, then a user might know that the file extension is suspect, or the software developer using Tika could take steps rename a file to its correct extension or make a copy with a correct extension. I can drop the issue at this point; I just wanted to see if someone thought that the behavior of -detect was obviously incorrect or not.
On Sun, Nov 20, 2011 at 4:43 PM, Nick Burch <[email protected]> wrote: > On Sun, 20 Nov 2011, John M wrote: >> >> With genuine .doc, .xls, or .ppt files, I'm not having a problem. I >> was wondering how good Tika was about being fooled with misnamed >> files, and so I took a .ppt, and just changed the extension to a .doc >> to see what would occur. Using the -m option turns out to be better >> than -d in this case. > > Please take another look at my example. I took a .doc, renamed it, and Tika > detected it just fine for me, hence my wondering why it is different for you > > Nick > >> On Sun, Nov 20, 2011 at 4:14 PM, Nick Burch <[email protected]> >> wrote: >>> >>> On Sun, 20 Nov 2011, John M wrote: >>>> >>>> I'm using a build from the 1.1 source. >>> >>> That's odd - with 1.1 TikaCLI will use DefaultDetector, which loads all >>> available detectors including the container aware ones >>> >>> However, I'm not able to reproduce your problem: >>> >>> cd /tmp >>> cp ~/test.doc C1.doc >>> cp ~/test.doc C1.xls >>> cp ~/test.doc C1.ppt >>> cd ~/java/apache-tika/tika-app/target >>> for i in /tmp/C1*; do echo ""; echo $i; java -jar >>> tika-app-1.1-SNAPSHOT.jar >>> --detect $i; done >>> >>> /tmp/C1.doc >>> application/msword >>> >>> /tmp/C1.ppt >>> application/vnd.ms-powerpoint >>> >>> /tmp/C1.xls >>> application/vnd.ms-excel >>> >>> >>> So I do get the container aware detection working properly. Not sure >>> what's >>> not working for you.... >>> >>> Nick >>> >> >
