I'm trying to parse a directory full of .eml files (and many have
attachments). Even though I use -J, I'm not seeing the results of OCR on
the attachments. I'm also not seeing anything extracted from PDFs. Finally,
tika-app is not recognizing a bunch of command flags.

I'm running ubuntu 18.04 and have openjdk-8 (1.181) installed with the
latest maven (3.5.4).
I've also got the libtesseract-dev and tesseract-OCR-all installed on my
machine.

I downloaded Tika 1.18 and ran mvn clean install.  The build completes fine
and I see the tika-app jar at ~/tika-1.18/tika-app-target/tika-app-1.18.jar

I am able to run java -jar ~pathto/tika-app-1.18.jar -J -i
/mydirectoryoffiles/ -o /mytikaoutput/ and it works alright.

I am not able to pass any other flags to tika though. for example -r.
I'm not able to pass -z to extract attachments either.

I get stuff like this:
"INFO  about to start driver
BatchProcess:No config file set via -bc, relying on
tika-app-batch-config.xml or default-tika-batch-config.xml
INFO  BatchProcess: org.apache.commons.cli.UnrecognizedOptionException:
Unrecognized option: -z"

Can anyone tell me how I can parse a directory of .eml files and extract
the data from their attachments?

Reply via email to