Hi Jake,
In reverse order...

1) command flags:  right, sorry, we've only implemented text/metadata
extraction via batch-mode (triggered by -i and -o).  The -z option
currently only operates one file at a time.

2) "Even though I use -J, I'm not seeing the results of OCR on the
attachments" ... when you type 'tesseract' at the command line, does
that kickoff tesseract, or is it not on your path...do you have a
custom installation?  If you run tika-app.jar -J against a single file
with an attachment that should be OCR'd, what values are you getting
for X-ParsedBy.... to help isolate whether tesseract is being called
at all, try running standalone tika-app.jar -J against, e.g.
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testOCR.docx
or 
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testOCR.pdf

3) "I'm also not seeing anything extracted from PDFs" -- are the PDF's
image only or do they actually contain text?  If image only, once we
figure out whether tesseract is being called at all, that might solve
the problem, but also see:
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR for
how to use a tika-config to turn on the extraction/OCR'ing of inline
images in PDFs.
On Mon, Aug 27, 2018 at 3:56 PM Jake Burns <[email protected]> wrote:
>
> I'm trying to parse a directory full of .eml files (and many have 
> attachments). Even though I use -J, I'm not seeing the results of OCR on the 
> attachments. I'm also not seeing anything extracted from PDFs. Finally, 
> tika-app is not recognizing a bunch of command flags.
>
> I'm running ubuntu 18.04 and have openjdk-8 (1.181) installed with the latest 
> maven (3.5.4).
> I've also got the libtesseract-dev and tesseract-OCR-all installed on my 
> machine.
>
> I downloaded Tika 1.18 and ran mvn clean install.  The build completes fine 
> and I see the tika-app jar at ~/tika-1.18/tika-app-target/tika-app-1.18.jar
>
> I am able to run java -jar ~pathto/tika-app-1.18.jar -J -i 
> /mydirectoryoffiles/ -o /mytikaoutput/ and it works alright.
>
> I am not able to pass any other flags to tika though. for example -r.
> I'm not able to pass -z to extract attachments either.
>
> I get stuff like this:
> "INFO  about to start driver
> BatchProcess:No config file set via -bc, relying on tika-app-batch-config.xml 
> or default-tika-batch-config.xml
> INFO  BatchProcess: org.apache.commons.cli.UnrecognizedOptionException: 
> Unrecognized option: -z"
>
> Can anyone tell me how I can parse a directory of .eml files and extract the 
> data from their attachments?

Reply via email to