I'm trying to parse a directory full of .eml files (and many have attachments). Even though I use -J, I'm not seeing the results of OCR on the attachments. I'm also not seeing anything extracted from PDFs. Finally, tika-app is not recognizing a bunch of command flags.
I'm running ubuntu 18.04 and have openjdk-8 (1.181) installed with the latest maven (3.5.4). I've also got the libtesseract-dev and tesseract-OCR-all installed on my machine. I downloaded Tika 1.18 and ran mvn clean install. The build completes fine and I see the tika-app jar at ~/tika-1.18/tika-app-target/tika-app-1.18.jar I am able to run java -jar ~pathto/tika-app-1.18.jar -J -i /mydirectoryoffiles/ -o /mytikaoutput/ and it works alright. I am not able to pass any other flags to tika though. for example -r. I'm not able to pass -z to extract attachments either. I get stuff like this: "INFO about to start driver BatchProcess:No config file set via -bc, relying on tika-app-batch-config.xml or default-tika-batch-config.xml INFO BatchProcess: org.apache.commons.cli.UnrecognizedOptionException: Unrecognized option: -z" Can anyone tell me how I can parse a directory of .eml files and extract the data from their attachments?
