I'm trying to get tika to detect .bat and .cmd files. Both are returning as text/plain.
In the xml file, (https://github.com/apache/tika/blob/master/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml) bat falls under application/x-msdownload but yet it returns as text/plain. cmd is under text/plain also surprisingly. I would have expected it to be with .bat. Has anyone had tika properly detect batch script files? The closest thing I can find when searching for this is this unresolved ticket: https://issues.apache.org/jira/browse/TIKA-1148 When I run the tika-app jar by itself, I get the same results (plain/text) as when I'm doing this through java code. > java -jar tika-app-1.16.jar -d BatchInstall.bat Aug 23, 2017 9:40:22 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. TIFFImageWriter not loaded. tiff files will not be processed See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. Aug 23, 2017 9:40:23 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version. text/plain ===================== Java version private static final Tika CONTENT_TYPE_DETECTOR = new Tika(); return CONTENT_TYPE_DETECTOR.detect(fileItem.get(), fileItem.getName()) // Returns text/plain
