Thanks, I guess I'll refrain from using extra flags on the command line.

I think the majority of files tika doesn't parse is due to tesseractOCR timeouts.

If I run:

java -jar ~/tika-1.18/tika-app/target/tika-app-1.18.jar -J -t -i /mailin/ -o /mailout/

A lot of my messages will have timeouts like this where the X-TIKA:content object should be.:

X-TIKA:EXCEPTION:embedded_exception":"org.apache.tika.exception.TikaException: TesseractOCRParser timeout\n\tat org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:560)\n\tat org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:432)\n\tat org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:286)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)\n\tat org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:318)\n\tat org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat org.apache.tika.parser.mail.MailContentHandler.handleEmbedded(MailContentHandler.java:283)\n\tat org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:228)\n\tat org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)\n\tat org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:100)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)\n\tat org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:159)\n\tat org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:406)\n\tat org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104)\n\tat org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:181)\n\tat org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115)\n\tat org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: java.util.concurrent.TimeoutException\n\tat java.util.concurrent.FutureTask.get(FutureTask.java:205)\n\tat org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:549)\n\t... 32 more\n","X-TIKA:digest:MD5":"79171517bfedab52b24bd1691a5ff544","X-TIKA:embedded_resource_path":"/CastleBrooks Ulana.jpg


Sometimes tika won't put any content at all in the output. It will just be filename.eml.json of 0 bytes, that happens when I run:

java -Xmx12g -Xms12g -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true -jar ~/tika-1.18/tika-app/target/tika-app-1.18.jar -J -t -i /mailin/ -o /mailout/

Sometimes the tika processing just grinds to a halt with illegalIOexception too.

TL;DR -

I'm running 24 CPU Cores with 64 GB of RAM on SSDs.

With a directory of 100,000 .eml files (many with attachments), is there a recommended way to parallelize or do batch parsing reliably?


On 08/28/2018 07:54 AM, Tim Allison wrote:
Hi Jake,
In reverse order...

1) command flags:  right, sorry, we've only implemented text/metadata
extraction via batch-mode (triggered by -i and -o).  The -z option
currently only operates one file at a time.

2) "Even though I use -J, I'm not seeing the results of OCR on the
attachments" ... when you type 'tesseract' at the command line, does
that kickoff tesseract, or is it not on your path...do you have a
custom installation?  If you run tika-app.jar -J against a single file
with an attachment that should be OCR'd, what values are you getting
for X-ParsedBy.... to help isolate whether tesseract is being called
at all, try running standalone tika-app.jar -J against, e.g.
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testOCR.docx
or 
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testOCR.pdf

3) "I'm also not seeing anything extracted from PDFs" -- are the PDF's
image only or do they actually contain text?  If image only, once we
figure out whether tesseract is being called at all, that might solve
the problem, but also see:
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR for
how to use a tika-config to turn on the extraction/OCR'ing of inline
images in PDFs.
On Mon, Aug 27, 2018 at 3:56 PM Jake Burns <[email protected]> wrote:
I'm trying to parse a directory full of .eml files (and many have attachments). 
Even though I use -J, I'm not seeing the results of OCR on the attachments. I'm 
also not seeing anything extracted from PDFs. Finally, tika-app is not 
recognizing a bunch of command flags.

I'm running ubuntu 18.04 and have openjdk-8 (1.181) installed with the latest 
maven (3.5.4).
I've also got the libtesseract-dev and tesseract-OCR-all installed on my 
machine.

I downloaded Tika 1.18 and ran mvn clean install.  The build completes fine and 
I see the tika-app jar at ~/tika-1.18/tika-app-target/tika-app-1.18.jar

I am able to run java -jar ~pathto/tika-app-1.18.jar -J -i /mydirectoryoffiles/ 
-o /mytikaoutput/ and it works alright.

I am not able to pass any other flags to tika though. for example -r.
I'm not able to pass -z to extract attachments either.

I get stuff like this:
"INFO  about to start driver
BatchProcess:No config file set via -bc, relying on tika-app-batch-config.xml 
or default-tika-batch-config.xml
INFO  BatchProcess: org.apache.commons.cli.UnrecognizedOptionException: Unrecognized 
option: -z"

Can anyone tell me how I can parse a directory of .eml files and extract the 
data from their attachments?

Reply via email to